Capsone Project with Spinnaker Analytics

Team A6: Yihan Jia, Yinghao Wang, Simeng Li, Xiangshan Mu, Kaiyu Wang

Business Objectives:

  • The objective of the exercise is to find a tradable signal in the dataset.

Project Flow:

  • Standard time series models can be used or non-parametric models like neural networks can be deployed. Suggested course of actions is as follows:
    1. Understand the dataset and what it represents
    2. Define the problem: what does 'a tradable signal' mean based on the data available (i.e. develop your modeling equation, if any)
    3. Start with simple analysis to identify patterns and outliers
    4. Develop models with increasing complexity - i.e. don't start with neural nets etc
    5. Define your success criteria - i.e. when do you know you have a tradable signal - positive (buy)/negative (sell), both are signals

Library Imports

In [ ]:
# imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt

from google.colab import drive

from statsmodels.tsa.stattools import adfuller

from statsmodels.tsa.api import ExponentialSmoothing

from statsmodels.graphics.tsaplots import plot_pacf, plot_acf

from statsmodels.tsa.arima_model import ARIMA



from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree, neural_network
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import  AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_recall_fscore_support

#reset recursionlimit
import sys
sys.setrecursionlimit(10000)

import torch #pytorch
import torch.nn as nn
from torch.autograd import Variable

# supress filter warnings
import warnings
warnings.filterwarnings("ignore")

#plt style
plt.style.use("seaborn")
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

Data Source

The dataset contains sectoral data for 3 separate types of investments made in the US (no international funds are included) and represents 60-70% of activity in the overall asset classes for that week (not all investors report data at the same time). The weekly data spans 10 years from 2006 through end-Jan 2017. Not all sectors have data available for all the dates since new investment vehicles are introduced at various points in time.

  1. Institutional Mutual Fund Holdings (investments made/redeemed by institutional investors like Fidelity, Vanguard on behalf of insitutions like CALPERS etc)
  2. Retail Mutual Fund Holdings (investments made by individuals in their portfolios)
  3. Exchange Traded Funds - insitutional investors

Data fields & descriptions:

  • ReportDate: Weekly data aggregated and released every Wednesday
  • AssetClass: Industry/Sector/Asset Class
  • Flow: Amount of positive (inflow) or negative (outflow) in Millions of USD
  • FlowPct: Flows as percent of assets at beginning of the week
  • AssetsEnd: Assets at end of the week in Millions of USD
  • PortfolioChangePct: Percent change in overall portfolio during the week

Data Imports

In [ ]:
# read datasets
# try:
#   # mount my google drive
#   drive.mount('/content/drive')
#   # change directory to the path that contains dataset uploaded
#   %cd /content/drive/My Drive/Capstone Project

#   # read through google drive
#   IMF = pd.read_csv('US Sector Inst ETF.csv')
#   ETF = pd.read_csv('US Sector Inst ETF.csv')
#   RMF  = pd.read_csv('US Sector Retail MF.csv')

# except: 
  # or read through github
ETF = pd.read_csv('https://raw.githubusercontent.com/yinghaow525/MSBA-Projects/main/Capstone%20Project/Datasets/US%20Sector%20Inst%20ETF.csv')
IMF = pd.read_csv('https://raw.githubusercontent.com/yinghaow525/MSBA-Projects/main/Capstone%20Project/Datasets/US%20Sector%20Institutional%20MF.csv')
RMF = pd.read_csv('https://raw.githubusercontent.com/yinghaow525/MSBA-Projects/main/Capstone%20Project/Datasets/US%20Sector%20Retail%20MF.csv')
In [ ]:
print(IMF.head(3))
print('-'*50)
print(ETF.head(3))
print('-'*50)
print(RMF.head(3))
             ReportDate                                         AssetClass  \
0  2/1/2017 12:00:00 AM  Commodities/Materials-North America-USA-North ...   
1  2/1/2017 12:00:00 AM  Consumer Goods-North America-USA-North America...   
2  2/1/2017 12:00:00 AM      Energy-North America-USA-North America-Equity   

         Flow  FlowPct     AssetsEnd  PortfolioChangePct  
0  380.126769   4.4676   8787.631717             -1.1853  
1  332.526792   1.1479  28973.613065             -1.1231  
2  217.857886   0.3460  62136.008339             -1.6441  
--------------------------------------------------
             ReportDate                                         AssetClass  \
0  2/1/2017 12:00:00 AM  Commodities/Materials-North America-USA-North ...   
1  2/1/2017 12:00:00 AM  Consumer Goods-North America-USA-North America...   
2  2/1/2017 12:00:00 AM      Energy-North America-USA-North America-Equity   

         Flow  FlowPct     AssetsEnd  PortfolioChangePct  
0  378.578706   4.5064   8679.056347             -1.1938  
1  332.526792   1.1479  28973.613065             -1.1231  
2  186.031374   0.3782  48446.700077             -1.8855  
--------------------------------------------------
             ReportDate                                         AssetClass  \
0  2/1/2017 12:00:00 AM  Commodities/Materials-North America-USA-North ...   
1  2/1/2017 12:00:00 AM  Consumer Goods-North America-USA-North America...   
2  2/1/2017 12:00:00 AM      Energy-North America-USA-North America-Equity   

        Flow  FlowPct     AssetsEnd  PortfolioChangePct  
0  25.681524   8.0046    344.521230             -0.6207  
1  -3.951259  -1.1736    329.044261             -1.0916  
2  73.933969   0.2713  27246.540015             -0.2297  

Concating & Preprocessing

In [ ]:
# Create type colunm
ETF['Type'] = 'ETF'
IMF['Type'] = 'IMF'
RMF['Type'] = 'RMF'

# Concat 3 dadaset
data = pd.concat([ETF, IMF, RMF], axis=0)
In [ ]:
# check NAs
print(data.isna().sum())
ReportDate            0
AssetClass            0
Flow                  0
FlowPct               0
AssetsEnd             0
PortfolioChangePct    0
Type                  0
dtype: int64
In [ ]:
# check duplicates
data.duplicated().sum()
Out[ ]:
0
In [ ]:
# convert report date to pandas datetime object
data['ReportDate'] = pd.to_datetime(data['ReportDate'])
In [ ]:
# since all the Asset and Sector values are the same (Equity, and within US)
# we only keep Industry sector as features
data['AssetClass'] = data['AssetClass'].str.split('-')
data['Industry'] = data.AssetClass.apply(lambda x: x[0])

# drop orignal AssetClass feature
data.drop(columns='AssetClass', inplace=True)

# types and number of Industry values
data.Industry.value_counts()
Out[ ]:
Mid Cap Growth           1737
Technology               1737
Energy                   1737
Financials               1737
Health Care/Biotech      1737
Utilities                1737
Large Cap Blend          1737
Large Cap Growth         1737
Large Cap Value          1737
Mid Cap Blend            1737
Telecom                  1737
Mid Cap Value            1737
Real Estate              1737
Small Cap Blend          1737
Small Cap Growth         1737
Small Cap Value          1737
Commodities/Materials    1700
Consumer Goods           1700
Industrials              1435
Infrastructure             61
Name: Industry, dtype: int64
In [ ]:
# extract year & month 
data['year'], data['month'] = data['ReportDate'].dt.year, data['ReportDate'].dt.month
data.head()
Out[ ]:
ReportDate Flow FlowPct AssetsEnd PortfolioChangePct Type Industry year month
0 2017-02-01 378.578706 4.5064 8679.056347 -1.1938 ETF Commodities/Materials 2017 2
1 2017-02-01 332.526792 1.1479 28973.613065 -1.1231 ETF Consumer Goods 2017 2
2 2017-02-01 186.031374 0.3782 48446.700077 -1.8855 ETF Energy 2017 2
3 2017-02-01 968.885223 1.5139 64378.816729 -0.9202 ETF Financials 2017 2
4 2017-02-01 410.686628 1.1532 36846.852743 2.3119 ETF Health Care/Biotech 2017 2

Exploratory Data Analysis & Statistical Analysis

In [ ]:
test = data.loc[(data.Type=='ETF') & (data.Industry == 'Energy')][['Flow','ReportDate']] 
plt.figure(figsize=(40,3))
plt.plot(test.ReportDate, test.Flow)
plt.title('Fund Flow of Energy Industry with ETF Type')
plt.show()

Exploritory Data Analysis

In [ ]:
# max & min value for Flow
print('Max value of Flow is {};'.format(data.Flow.max()), '\n'
      'Min value of Flow is {};'.format(data.Flow.min()),'\n'
      'Avg value of Flow is {}.'.format(data.Flow.mean()))
Max value of Flow is 23263.40489835; 
Min value of Flow is -13967.24062477; 
Avg value of Flow is 20.653445520794538.
In [ ]:
# distribution of entire PortfolioChangePct
plt.figure(figsize=(8,5))
sns.distplot(data.PortfolioChangePct, color='lightblue', bins=30, rug=True)
sns.distplot(data.PortfolioChangePct, hist=False)
plt.title("Distribution of PortfolioChangePct")
plt.show()
In [ ]:
# plotting PortfolioChangePct distribution by industry on boxplot, we can do via a loop
for _ in data.Industry.value_counts().index:
  plt.figure(figsize=(6,4))
  sns.boxplot(data.loc[data.Industry == _].PortfolioChangePct)
  plt.title(_+' Industry PortfolioChangePct')
  plt.show()
In [ ]:
# ... or agg into a single plot
  data.boxplot(column = 'PortfolioChangePct', by = 'Industry', figsize=(15,10), 
               rot = 45, fontsize = 10, patch_artist = True)
  plt.show()


  ###  outliers: >= |15| ???
In [ ]:
# ... or agg into a single plot
  # wide version

  data.boxplot(column = 'PortfolioChangePct', by = 'Industry', figsize=(80,20), 
               rot = 45, fontsize = 16, patch_artist = True)
  plt.show()

  ###  outliers: >= |15| ???
In [ ]:
# group by education
groupby_type = data.groupby('Type').mean()
groupby_type.reset_index(inplace=True)
groupby_type
Out[ ]:
Type Flow FlowPct AssetsEnd PortfolioChangePct year month
0 ETF 62.592954 0.266612 30641.955796 0.172921 2011.052349 6.485958
1 IMF 46.247082 0.289701 59240.805803 0.166809 2011.059090 6.487610
2 RMF -49.033985 0.381269 46085.848006 0.148504 2011.182133 6.503562
In [ ]:
sns.barplot(groupby_type.Type, groupby_type.Flow, data=groupby_type, palette=["royalblue","cornflowerblue","lightsteelblue"])
plt.title('Average Flow of Different Types of Investments')
plt.show()
In [ ]:
groupby_month = data.groupby('month').mean()
groupby_month.reset_index(inplace=True)
groupby_month
Out[ ]:
month Flow FlowPct AssetsEnd PortfolioChangePct year
0 1 -32.475495 0.139309 45333.427745 -0.109616 2011.479578
1 2 -33.873608 0.395732 43039.036434 0.177028 2011.197145
2 3 44.125802 0.640563 43228.952366 0.583817 2011.021860
3 4 -19.574994 0.463890 44889.925190 0.644990 2011.086506
4 5 -27.586862 0.049561 44162.754668 -0.190326 2010.924926
5 6 33.833469 0.098862 45035.580521 -0.223289 2011.133585
6 7 46.369918 0.374636 46066.357727 0.483851 2011.127760
7 8 -7.249800 0.197461 45288.663950 -0.154638 2010.990221
8 9 57.010341 0.393784 45663.886324 0.144342 2011.060309
9 10 6.533923 0.000688 45860.972009 -0.109039 2011.015124
10 11 79.609968 0.708036 46950.475156 0.317066 2011.043103
11 12 103.393917 0.310243 48147.746039 0.409564 2011.058675
In [ ]:
sns.barplot(groupby_month.month,groupby_month.Flow, data=groupby_month, color="cornflowerblue")
plt.title('Average Flow of Different Months')
plt.show()
In [ ]:
groupby_month = data.groupby('year').mean()
groupby_month.reset_index(inplace=True)
groupby_month
Out[ ]:
year Flow FlowPct AssetsEnd PortfolioChangePct month
0 2006 9.709595 0.167481 20774.625815 0.294400 6.563777
1 2007 24.347559 0.388225 28292.320694 0.191274 6.461538
2 2008 33.632580 0.338836 27675.884230 -0.842377 6.528302
3 2009 -16.307381 0.410635 23518.159685 0.604485 6.615385
4 2010 6.833270 0.249576 30376.103635 0.361180 6.576923
5 2011 10.570226 0.201189 37835.369459 0.002048 6.555936
6 2012 7.170200 0.116301 43563.443944 0.281452 6.480769
7 2013 64.293285 0.273688 54932.094255 0.497152 6.442308
8 2014 60.672181 0.073182 68550.205126 0.213110 6.532893
9 2015 -22.177792 0.373536 77872.282476 -0.055566 6.615385
10 2016 40.293112 0.710150 77917.702985 0.246081 6.573231
11 2017 76.595077 1.517722 87131.860941 0.290206 1.200000
In [ ]:
sns.barplot(groupby_month.year,groupby_month.Flow, data=groupby_month, color="cornflowerblue")
plt.title('Average Flow of Different Years')
plt.show()
In [ ]:
groupby_indus = data.groupby('Industry').mean()
groupby_indus = groupby_indus[['Flow','FlowPct','PortfolioChangePct','AssetsEnd']]
groupby_indus.T
Out[ ]:
Industry Commodities/Materials Consumer Goods Energy Financials Health Care/Biotech Industrials Infrastructure Large Cap Blend Large Cap Growth Large Cap Value Mid Cap Blend Mid Cap Growth Mid Cap Value Real Estate Small Cap Blend Small Cap Growth Small Cap Value Technology Telecom Utilities
Flow 4.618448 14.705494 56.642795 43.174948 18.434165 14.458547 -0.185774 234.353609 -56.994477 18.802714 37.528139 -33.864993 0.415125 29.382265 19.458505 -24.579755 1.050384 12.797373 0.941891 0.276796
FlowPct 0.527914 0.706988 0.327309 0.468821 0.104273 0.370827 14.146915 0.095324 -0.007993 0.083720 0.076739 0.000748 0.007356 0.187202 0.188832 -0.034364 0.036308 0.102159 2.160087 0.059619
PortfolioChangePct 0.146966 0.169154 0.117991 0.076565 0.212619 0.198256 0.246716 0.150790 0.177993 0.144915 0.175615 0.171738 0.171795 0.169534 0.185696 0.187447 0.183333 0.200060 0.121771 0.135717
AssetsEnd 2845.521245 9659.567516 19386.442725 16082.303532 13996.433462 4459.504929 12.503560 286876.680400 120252.248497 131580.595345 42831.562165 38112.991915 30357.028293 23929.996101 45352.048328 27098.685214 19300.928749 14354.973156 670.525253 6773.621061
In [ ]:
fig=sns.heatmap(groupby_indus.corr(), cmap="Blues")
fig
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc4018987d0>
In [ ]:
list_ind=['Flow','FlowPct','PortfolioChangePct','AssetsEnd']
for i in list_ind:
  groupby_indus = data.groupby('Industry').mean()
  groupby_indus = groupby_indus[[i]]
  groupby_indus.plot(figsize=(16,8))
  scale_ls = range(19)
  index_ls = ['Commodities/Materials','Consumer Goods','Energy','Financials','Health Care/Biotech','Industrials','Infrastructure','Large Cap Blend','Large Cap Growth','Large Cap Value','Mid Cap Blend','Mid Cap Growth','Mid Cap Value','Real Estate','Small Cap Blend','Small Cap Growth','Small Cap Value','Technology','Telecom','Utilities']
  plt.xticks(scale_ls,index_ls,rotation=45)
  plt.title('Average ' + i + ' of different industries')
  plt.xlabel('industries')
  plt.ylabel(i)
  plt.legend()
  plt.show()

Discountinuity Check

We find four industies have less data records than others, these four industries are: Commodities/Materials, Consumer Goods, Industrials, and Infrastructure.

We want to understand the missing records in 2 ways:

  1. whether certain type have less records than the others;
  2. if so, we want to know where the record missing is: at the front of the timeline, at the back, in the middle or with not pattern.

For Commodities/Materials:

In [ ]:
# checking missing data for Commodities/Materials, and RMF type has less records
data.loc[data.Industry == 'Commodities/Materials'].Type.value_counts()
Out[ ]:
ETF    579
IMF    579
RMF    542
Name: Type, dtype: int64
In [ ]:
# and we figure out that Commodities/Materials has missing RMF records at the beginning of entire timeline 
print('The most recent ReportDate for RMF type Commodities/Materials industry is {};'.format(data.loc[(data.Industry == 'Commodities/Materials') & 
                                                 (data.Type == 'RMF')].ReportDate.max()), '\n'
      'The least recent ReportDate for RMF type Commodities/Materials industry is {}.'.format(data.loc[(data.Industry == 'Commodities/Materials') & 
                                                 (data.Type == 'RMF')].ReportDate.min()),'\n'
      'Compare with other typed while same industry funds, their most recent ReportDate is {};'.format(data.loc[(data.Industry == 'Commodities/Materials') & 
                                                 (data.Type == 'IMF')].ReportDate.max()), '\n'
      'and their least recent ReportDate is {}.'.format(data.loc[(data.Industry == 'Commodities/Materials') & 
                                                 (data.Type == 'IMF')].ReportDate.min())
                                                 )
The most recent ReportDate for RMF type Commodities/Materials industry is 2017-02-01 00:00:00; 
The least recent ReportDate for RMF type Commodities/Materials industry is 2006-09-20 00:00:00. 
Compare with other typed while same industry funds, their most recent ReportDate is 2017-02-01 00:00:00; 
and their least recent ReportDate is 2006-01-04 00:00:00.
In [ ]:
from datetime import datetime
def week_num(start_time, end_time):
  # week_start = datetime.strptime(start_time, '%Y-%m-%d')
  # week_end = datetime.strptime(end_time, '%Y-%m-%d')
  year_week_num = 52
  week_end_year = end_time.year
  week_start_year = start_time.year
  week_end_num = int(datetime.strftime(end_time, '%W'))
  week_start_num = int(datetime.strftime(start_time, '%W'))

  diff = (week_end_year - week_start_year) * year_week_num + week_end_num - week_start_num
  return diff
In [ ]:
# calculate that different in weeks for two type's start time is equals to number of missing records
# no other missing data otherwise, no discontinuity exists
week_num(data.loc[(data.Industry == 'Commodities/Materials') & (data.Type == 'IMF')].ReportDate.min(), # start time
         data.loc[(data.Industry == 'Commodities/Materials') & (data.Type == 'RMF')].ReportDate.min(), # end time
)
Out[ ]:
37
In [ ]:
# Commodities/Materials Flow plot for 3 types:

# RMF type has a later start than the other two, and the range is smaller
plt.plot(data.loc[(data.Industry == 'Commodities/Materials') & 
                                                 (data.Type == 'RMF')].ReportDate, 
         data.loc[(data.Industry == 'Commodities/Materials') & 
                                                 (data.Type == 'RMF')].Flow,
         label = 'RMF')
plt.title('Flow Plot for Commodities/Materials in RMF Type')
plt.xlabel('Time Range')
plt.ylabel('Flow Range')
plt.show()
In [ ]:
# IMF type generally stay within range -500 ~ 500, while has a large outflow reached about -2000 in 2014 May
plt.plot(data.loc[(data.Industry == 'Commodities/Materials') & 
                                                 (data.Type == 'IMF')].ReportDate, 
         data.loc[(data.Industry == 'Commodities/Materials') & 
                                                 (data.Type == 'IMF')].Flow,
         label = 'IMF')
plt.title('Flow Plot for Commodities/Materials in IMF Type')
plt.xlabel('Time Range')
plt.ylabel('Flow Range')
plt.show()
In [ ]:
# RMF type is very similar to that of IMF
# the similarity exists in trend, and value amount of flow
plt.plot(data.loc[(data.Industry == 'Commodities/Materials') & 
                                                 (data.Type == 'ETF')].ReportDate, 
         data.loc[(data.Industry == 'Commodities/Materials') & 
                                                 (data.Type == 'ETF')].Flow,
         label = 'ETF')
plt.title('Flow Plot for Commodities/Materials in ETF Type')
plt.xlabel('Time Range')
plt.ylabel('Flow Range')
plt.show()
In [ ]:
# flow amount for IMF
data.loc[(data.Industry == 'Commodities/Materials') & 
                                                 (data.Type == 'IMF')].Flow
Out[ ]:
0        380.126769
20       250.977996
40       -92.352798
60       169.136975
80      -255.592884
            ...    
10922      0.000000
10941    -34.945224
10960     -2.269946
10979     11.561203
10998     23.148567
Name: Flow, Length: 579, dtype: float64
In [ ]:
# flow amount for ETF
data.loc[(data.Industry == 'Commodities/Materials') & 
                                                 (data.Type == 'ETF')].Flow
Out[ ]:
0        378.578706
19       250.733560
38       -92.853355
57       168.711676
76      -254.535873
            ...    
10908      0.000000
10927    -34.945224
10946     -2.269946
10965     11.561203
10984     23.148567
Name: Flow, Length: 579, dtype: float64

For Consumer Goods:

In [ ]:
# checking missing data for Comsumer Goods
# similarly, this industry also has less records in RMF type
data.loc[data.Industry == 'Consumer Goods'].Type.value_counts()
Out[ ]:
ETF    579
IMF    579
RMF    542
Name: Type, dtype: int64
In [ ]:
# and we figure out that Consumer Goods has missing RMF records at the beginning of entire timeline 
print('The most recent ReportDate for RMF type Consumer Goods industry is {};'.format(data.loc[(data.Industry == 'Consumer Goods') & 
                                                 (data.Type == 'RMF')].ReportDate.max()), '\n'
      'The least recent ReportDate for RMF type Consumer Goods industry is {}.'.format(data.loc[(data.Industry == 'Consumer Goods') & 
                                                 (data.Type == 'RMF')].ReportDate.min()),'\n'
      'Compare with other typed while same industry funds, their most recent ReportDate is {};'.format(data.loc[(data.Industry == 'Consumer Goods') & 
                                                 (data.Type == 'IMF')].ReportDate.max()), '\n'
      'and their least recent ReportDate is {}.'.format(data.loc[(data.Industry == 'Consumer Goods') & 
                                                 (data.Type == 'IMF')].ReportDate.min())
                                                 )
The most recent ReportDate for RMF type Consumer Goods industry is 2017-02-01 00:00:00; 
The least recent ReportDate for RMF type Consumer Goods industry is 2006-09-20 00:00:00. 
Compare with other typed while same industry funds, their most recent ReportDate is 2017-02-01 00:00:00; 
and their least recent ReportDate is 2006-01-04 00:00:00.
In [ ]:
# calculate that different in weeks for two type's start time is equals to number of missing records
# no other missing data otherwise, no discontinuity exists
week_num(data.loc[(data.Industry == 'Consumer Goods') & (data.Type == 'IMF')].ReportDate.min(), # start time
         data.loc[(data.Industry == 'Consumer Goods') & (data.Type == 'RMF')].ReportDate.min(), # end time
)
Out[ ]:
37
In [ ]:
# Consumer Goods Flow plot for 3 types:

# RMF type has a later start than the other two, and the Flow range is smaller
plt.plot(data.loc[(data.Industry == 'Consumer Goods') & 
                                                 (data.Type == 'RMF')].ReportDate, 
         data.loc[(data.Industry == 'Consumer Goods') & 
                                                 (data.Type == 'RMF')].Flow,
         label = 'RMF')
plt.title('Flow Plot for Consumer Goods in RMF Type')
plt.xlabel('Time Range')
plt.ylabel('Flow Range')
plt.show()
In [ ]:
# most flow amount is within -800 ~ 1000 range, while there exists extreme drop in the beginning of 2014
plt.plot(data.loc[(data.Industry == 'Consumer Goods') & 
                                                 (data.Type == 'IMF')].ReportDate, 
         data.loc[(data.Industry == 'Consumer Goods') & 
                                                 (data.Type == 'IMF')].Flow,
         label = 'IMF')
plt.title('Flow Plot for Consumer Goods in IMF Type')
plt.xlabel('Time Range')
plt.ylabel('Flow Range')
plt.show()
In [ ]:
# this time, the RMF type is still very similar to IMF one
plt.plot(data.loc[(data.Industry == 'Consumer Goods') & 
                                                 (data.Type == 'ETF')].ReportDate, 
         data.loc[(data.Industry == 'Consumer Goods') & 
                                                 (data.Type == 'ETF')].Flow,
         label = 'ETF')
plt.title('Flow Plot for Consumer Goods in ETF Type')
plt.xlabel('Time Range')
plt.ylabel('Flow Range')
plt.show()

For Industrials:

In [ ]:
# checking missing data for Industrials
# similarly, this industry also has less records in RMF type
data.loc[data.Industry == 'Industrials'].Type.value_counts()
Out[ ]:
ETF    579
IMF    579
RMF    277
Name: Type, dtype: int64
In [ ]:
# and we figure out that Industrials has missing RMF records at the beginning of entire timeline 
print('The most recent ReportDate for RMF type Industrials industry is {};'.format(data.loc[(data.Industry == 'Industrials') & 
                                                 (data.Type == 'RMF')].ReportDate.max()), '\n'
      'The least recent ReportDate for RMF type Industrials industry is {}.'.format(data.loc[(data.Industry == 'Industrials') & 
                                                 (data.Type == 'RMF')].ReportDate.min()),'\n'
      'Compare with other typed while same industry funds, their most recent ReportDate is {};'.format(data.loc[(data.Industry == 'Industrials') & 
                                                 (data.Type == 'IMF')].ReportDate.max()), '\n'
      'and their least recent ReportDate is {}.'.format(data.loc[(data.Industry == 'Industrials') & 
                                                 (data.Type == 'IMF')].ReportDate.min())
                                                 )
The most recent ReportDate for RMF type Industrials industry is 2017-02-01 00:00:00; 
The least recent ReportDate for RMF type Industrials industry is 2011-10-19 00:00:00. 
Compare with other typed while same industry funds, their most recent ReportDate is 2017-02-01 00:00:00; 
and their least recent ReportDate is 2006-01-04 00:00:00.
In [ ]:
# calculate that different in weeks for two type's start time is equals to number of missing records
# no other missing data otherwise, no discontinuity exists 
week_num(data.loc[(data.Industry == 'Industrials') & (data.Type == 'IMF')].ReportDate.min(), # start time
         data.loc[(data.Industry == 'Industrials') & (data.Type == 'RMF')].ReportDate.min(), # end time
)
Out[ ]:
301
In [ ]:
# Industrials plot for 3 types:

# RMF type has a much later start than the other two, and the Flow range is smaller
plt.plot(data.loc[(data.Industry == 'Industrials') & 
                                                 (data.Type == 'RMF')].ReportDate, 
         data.loc[(data.Industry == 'Industrials') & 
                                                 (data.Type == 'RMF')].Flow,
         label = 'RMF')
plt.title('Flow Plot for Industrials in RMF Type')
plt.xlabel('Time Range')
plt.ylabel('Flow Range')
plt.show()
In [ ]:
# most flow amount is within -500 ~ 500 range, while there exists extreme drop in 2014
# and extreme rise in 2017
plt.plot(data.loc[(data.Industry == 'Industrials') & 
                                                 (data.Type == 'IMF')].ReportDate, 
         data.loc[(data.Industry == 'Industrials') & 
                                                 (data.Type == 'IMF')].Flow,
         label = 'IMF')
plt.title('Flow Plot for Industrials in IMF Type')
plt.xlabel('Time Range')
plt.ylabel('Flow Range')
plt.show()
In [ ]:
# this time, the RMF type is not similar to IMF one,
# the trend is very unstable, very frequent rises and drops during the timeline
plt.plot(data.loc[(data.Industry == 'Industrials') & 
                                                 (data.Type == 'ETF')].ReportDate, 
         data.loc[(data.Industry == 'Industrials') & 
                                                 (data.Type == 'ETF')].Flow,
         label = 'ETF')
plt.title('Flow Plot for Industrials in ETF Type')
plt.xlabel('Time Range')
plt.ylabel('Flow Range')
plt.show()

For Infrastructure:

In [ ]:
# checking missing data for Infrastructure
# this industry has too few records for visualization
data.loc[data.Industry == 'Infrastructure'].Type.value_counts()
Out[ ]:
RMF    43
IMF    16
ETF     2
Name: Type, dtype: int64
In [ ]:
# and we figure out that Industrials has missing RMF records at the beginning of entire timeline 
print('The most recent ReportDate for RMF type Industrials industry is {};'.format(data.loc[(data.Industry == 'Infrastructure') & 
                                                 (data.Type == 'RMF')].ReportDate.max()), '\n'
      'The least recent ReportDate for RMF type Industrials industry is {}.'.format(data.loc[(data.Industry == 'Infrastructure') & 
                                                 (data.Type == 'RMF')].ReportDate.min()),'\n'
      'Compare with IMF type while same industry funds, their most recent ReportDate is {};'.format(data.loc[(data.Industry == 'Infrastructure') & 
                                                 (data.Type == 'IMF')].ReportDate.max()), '\n'
      'and their least recent ReportDate is {}.'.format(data.loc[(data.Industry == 'Infrastructure') & 
                                                 (data.Type == 'IMF')].ReportDate.min()), '\n'
       'Compare with ETF type while same industry funds, their most recent ReportDate is {};'.format(data.loc[(data.Industry == 'Infrastructure') & 
                                                 (data.Type == 'ETF')].ReportDate.max()), '\n'
      'and their least recent ReportDate is {}.'.format(data.loc[(data.Industry == 'Infrastructure') & 
                                                 (data.Type == 'ETF')].ReportDate.min())
                                                 )
The most recent ReportDate for RMF type Industrials industry is 2017-02-01 00:00:00; 
The least recent ReportDate for RMF type Industrials industry is 2016-04-13 00:00:00. 
Compare with IMF type while same industry funds, their most recent ReportDate is 2017-02-01 00:00:00; 
and their least recent ReportDate is 2014-10-01 00:00:00. 
Compare with ETF type while same industry funds, their most recent ReportDate is 2014-10-08 00:00:00; 
and their least recent ReportDate is 2014-10-01 00:00:00.
In [ ]:
# Infrastructure plot for 3 types:

# RMF has less than 1 year records
plt.plot(data.loc[(data.Industry == 'Infrastructure') & 
                                                 (data.Type == 'RMF')].ReportDate, 
         data.loc[(data.Industry == 'Infrastructure') & 
                                                 (data.Type == 'RMF')].Flow,
         label = 'RMF')
plt.title('Flow Plot for Infrastructure in RMF Type')
plt.xlabel('Time Range')
plt.ylabel('Flow Range')
plt.show()
In [ ]:
# IMF has only 4 week records
# and the data exists time-line discountinuity
plt.scatter(x=data.loc[(data.Industry == 'Infrastructure') & 
                                                 (data.Type == 'IMF')].ReportDate, 
         y=data.loc[(data.Industry == 'Infrastructure') & 
                                                 (data.Type == 'IMF')].Flow,
         label = 'IMF')
plt.title('Flow Plot for Infrastructure in IMF Type')
plt.xlabel('Time Range')
plt.ylabel('Flow Range')
plt.show()

EDA Continued

Smoothing Trend

The moving average can remove the short-term fluctuations of the time series and make the data smooth, so that the trend characteristics of the series can be easily seen.

The chart below shows AssetsEnd data for 4 industry ETF/IMF/RMF stocks along with a 30-week simple moving average. Moving averages smooth out volatility, thus showing long-term volatility trends.

In [ ]:
from matplotlib.pyplot import MultipleLocator
In [ ]:
dataSMA = data[(data['Industry']=='Health Care/Biotech') & (data['Type']=='ETF')]
dataSMA.sort_index(axis=0,ascending=False,inplace=True)
dataSMA['SMA_3'] = dataSMA['AssetsEnd'].rolling(window=3).mean()
dataSMA['SMA_5'] = dataSMA['AssetsEnd'].rolling(window=5).mean()
dataSMA = dataSMA[['ReportDate','AssetsEnd','SMA_3','SMA_5']]
print(dataSMA)
      ReportDate     AssetsEnd         SMA_3         SMA_5
10988 2006-01-04   2601.925992           NaN           NaN
10969 2006-01-11   2713.500430           NaN           NaN
10950 2006-01-18   2820.538463   2711.988295           NaN
10931 2006-01-25   2706.582457   2746.873783           NaN
10912 2006-02-01   2829.093709   2785.404876   2734.328210
...          ...           ...           ...           ...
80    2017-01-04  36302.502477  35655.129299  35486.210045
61    2017-01-11  37003.864299  36239.294229  35852.612440
42    2017-01-18  36682.947283  36663.104686  36130.439896
23    2017-01-25  35612.145402  36432.985661  36202.595074
4     2017-02-01  36846.852743  36380.648476  36489.662441

[579 rows x 4 columns]
In [ ]:
dataSMA.plot(kind='line', x='ReportDate', y=['AssetsEnd','SMA_3','SMA_5'], color=['lightsteelblue','cornflowerblue','royalblue'])
plt.legend()
plt.xlabel('year')
plt.ylabel('moving average')
plt.title('Simple Moving Average of the Health Care/Biotech Industry ETF Stocks')
y_major_locator=MultipleLocator(10000)
ax=plt.gca()
ax.yaxis.set_major_locator(y_major_locator)
plt.ylim(0,80000)
plt.show()
In [ ]:
list_indus = ['Financials','Energy','Health Care/Biotech','Technology']
for i in list_indus:
  def SMA(AssetsEnd, n):
      return AssetsEnd.rolling(window=n).mean()

  def WMA(AssetsEnd, n):
      return AssetsEnd.rolling(window=n).apply(lambda x: x[::-1].cumsum().sum() * 2 / n / (n + 1))

  def EMA(AssetsEnd, n):
      return AssetsEnd.ewm(span=n, min_periods=n).mean()

  dataMA1 = data[(data['Industry']==i) & (data['Type']=='ETF')]
  dataMA1.sort_index(axis=0,ascending=False,inplace=True)

  dataMA1['SMA_30'] = SMA(dataMA1['AssetsEnd'], 30)
  dataMA1['WMA_30'] = WMA(dataMA1['AssetsEnd'], 30)
  dataMA1['EMA_30'] = EMA(dataMA1['AssetsEnd'], 30)
  dataMA1 = dataMA1[['ReportDate','AssetsEnd','SMA_30','WMA_30','EMA_30']]
  dataMA1.plot(kind='line', x='ReportDate', y=['AssetsEnd','SMA_30','WMA_30','EMA_30'], figsize=(16, 6), color=['lightsteelblue','lightblue','cornflowerblue','royalblue'])
  plt.legend()
  plt.xlabel('year')
  plt.ylabel('moving average')
  plt.title('Moving Average of the ' + i + ' Industry ETF Stocks')
  y_major_locator=MultipleLocator(10000)
  ax=plt.gca()
  ax.yaxis.set_major_locator(y_major_locator)
  plt.ylim(0,80000)
  plt.show()
In [ ]:
for i in list_indus:
  def SMA(AssetsEnd, n):
      return AssetsEnd.rolling(window=n).mean()

  def WMA(AssetsEnd, n):
      return AssetsEnd.rolling(window=n).apply(lambda x: x[::-1].cumsum().sum() * 2 / n / (n + 1))

  def EMA(AssetsEnd, n):
      return AssetsEnd.ewm(span=n, min_periods=n).mean()

  dataMA2 = data[(data['Industry']==i) & (data['Type']=='IMF')]
  dataMA2.sort_index(axis=0,ascending=False,inplace=True)

  dataMA2['SMA_30'] = SMA(dataMA2['AssetsEnd'], 30)
  dataMA2['WMA_30'] = WMA(dataMA2['AssetsEnd'], 30)
  dataMA2['EMA_30'] = EMA(dataMA2['AssetsEnd'], 30)
  dataMA2 = dataMA2[['ReportDate','AssetsEnd','SMA_30','WMA_30','EMA_30']]

  dataMA2.plot(kind='line', x='ReportDate', y=['AssetsEnd','SMA_30','WMA_30','EMA_30'], figsize=(16, 6), color=['lightsteelblue','lightblue','cornflowerblue','royalblue'])
  plt.legend()
  plt.xlabel('year')
  plt.ylabel('moving average')
  plt.title('Moving Average of the ' + i + ' Industry IMF Stocks')
  y_major_locator=MultipleLocator(10000)
  ax=plt.gca()
  ax.yaxis.set_major_locator(y_major_locator)
  plt.ylim(0,80000)
  plt.show()
In [ ]:
for i in list_indus:
  def SMA(AssetsEnd, n):
      return AssetsEnd.rolling(window=n).mean()

  def WMA(AssetsEnd, n):
      return AssetsEnd.rolling(window=n).apply(lambda x: x[::-1].cumsum().sum() * 2 / n / (n + 1))

  def EMA(AssetsEnd, n):
      return AssetsEnd.ewm(span=n, min_periods=n).mean()

  dataMA3 = data[(data['Industry']==i) & (data['Type']=='RMF')]
  dataMA3.sort_index(axis=0,ascending=False,inplace=True)

  dataMA3['SMA_30'] = SMA(dataMA3['AssetsEnd'], 30)
  dataMA3['WMA_30'] = WMA(dataMA3['AssetsEnd'], 30)
  dataMA3['EMA_30'] = EMA(dataMA3['AssetsEnd'], 30)
  dataMA3 = dataMA3[['ReportDate','AssetsEnd','SMA_30','WMA_30','EMA_30']]

  dataMA3.plot(kind='line', x='ReportDate', y=['AssetsEnd','SMA_30','WMA_30','EMA_30'], figsize=(16, 6), color=['lightsteelblue','lightblue','cornflowerblue','royalblue'])
  plt.legend()
  plt.xlabel('year')
  plt.ylabel('moving average')
  plt.title('Moving Average of the ' + i + ' Industry RMF Stocks')
  y_major_locator=MultipleLocator(10000)
  ax=plt.gca()
  ax.yaxis.set_major_locator(y_major_locator)
  plt.ylim(0,80000)
  plt.show()

Logarithmic Rate of Return

Regarding continuous compound returns, we can get the annual cumulative return by log-cumulative method.

In [ ]:
data['AssetsStart'] = data['AssetsEnd'] - data['Flow']
data.sample(3)
Out[ ]:
ReportDate Flow FlowPct AssetsEnd PortfolioChangePct Type Industry year month AssetsStart
9716 2007-01-31 3.812515 9.9697 42.582638 1.3846 RMF Commodities/Materials 2007 1 38.770122
1104 2015-12-30 1088.382234 0.3727 294068.260260 0.3338 IMF Large Cap Growth 2015 12 292979.878026
2030 2015-01-21 206.640009 0.6668 31394.211695 0.6528 ETF Technology 2015 1 31187.571686
In [ ]:
list_year = [2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017]
for i in list_indus:
  def funRMF(year):
    data_yHB = data[(data['year']==year ) & (data['Industry']==i) & (data['Type']=='ETF')]
    data_yHB.sort_index(axis=0,ascending=False,inplace=True)
#take log return
    dataHB = data_yHB.loc[:,['ReportDate','AssetsStart','AssetsEnd']]
    dataHB['log_price'] = np.log(dataHB['AssetsEnd'])
    dataHB['log_return'] = dataHB.log_price.diff()
    year_returnHB = dataHB.log_return.sum()
    return(year_returnHB)

  year_returnHB = [funRMF(i) for i in list_year]
  plt.plot(list_year, year_returnHB, color = 'cornflowerblue', label = 'year return')
  plt.legend()
  plt.xlabel('year')
  plt.ylabel('year_retunHB')
  plt.title('Year Returns for the ' + i + ' Industry ETF stocks')
  plt.axhline(y=0, color='grey', linestyle='--')
  plt.ylim(-1,2)
  y_major_locator=MultipleLocator(0.2)
  ax=plt.gca()
  ax.yaxis.set_major_locator(y_major_locator)
  plt.show()
In [ ]:
list_year = [2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017]
for i in list_indus:
  def funRMF(year):
    data_yHB = data[(data['year']==year ) & (data['Industry']==i) & (data['Type']=='IMF')]
    data_yHB.sort_index(axis=0,ascending=False,inplace=True)
#take log return
    dataHB = data_yHB.loc[:,['ReportDate','AssetsStart','AssetsEnd']]
    dataHB['log_price'] = np.log(dataHB['AssetsEnd'])
    dataHB['log_return'] = dataHB.log_price.diff()
    year_returnHB = dataHB.log_return.sum()
    return(year_returnHB)

  year_returnHB = [funRMF(i) for i in list_year]
  plt.plot(list_year, year_returnHB, color = 'cornflowerblue', label = 'year return')
  plt.legend()
  plt.xlabel('year')
  plt.ylabel('year_retunHB')
  plt.title('Year Returns for the ' + i + ' Industry IMF stocks')
  plt.axhline(y=0, color='grey', linestyle='--')
  plt.ylim(-1,2)
  y_major_locator=MultipleLocator(0.2)
  ax=plt.gca()
  ax.yaxis.set_major_locator(y_major_locator)
  plt.show()
In [ ]:
list_year = [2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017]
for i in list_indus:
  def funRMF(year):
    data_yHB = data[(data['year']==year ) & (data['Industry']==i) & (data['Type']=='RMF')]
    data_yHB.sort_index(axis=0,ascending=False,inplace=True)
#take log return
    dataHB = data_yHB.loc[:,['ReportDate','AssetsStart','AssetsEnd']]
    dataHB['log_price'] = np.log(dataHB['AssetsEnd'])
    dataHB['log_return'] = dataHB.log_price.diff()
    year_returnHB = dataHB.log_return.sum()
    return(year_returnHB)

  year_returnHB = [funRMF(i) for i in list_year]
  plt.plot(list_year, year_returnHB, color = 'cornflowerblue', label = 'year return')
  plt.legend()
  plt.xlabel('year')
  plt.ylabel('year_retunHB')
  plt.title('Year Returns for the ' + i + ' Industry RMF stocks')
  plt.axhline(y=0, color='grey', linestyle='--')
  plt.ylim(-1,2)
  y_major_locator=MultipleLocator(0.2)
  ax=plt.gca()
  ax.yaxis.set_major_locator(y_major_locator)
  plt.show()

Time Series Analysis

Local Optimization

Here we generate 3 (MA, SES & LSTM) models for each of 48 unique combination of fund type and fund industry, we aim at minimizing mse and reach a relatively high accuracy on predicting the tradable signal in terms of flow amount.

Among the four models, MA and SES estimations are simple to understand, fast to execute and return with relatively more accurate prediction results.

In [ ]:
# drop all rows where industry == Infrastructure
df = data.drop(data[data['Industry'] == 'Infrastructure'].index)
In [ ]:
data.Type.unique()
Out[ ]:
array(['ETF', 'IMF', 'RMF'], dtype=object)
In [ ]:
data.Industry.unique()
Out[ ]:
array(['Commodities/Materials', 'Consumer Goods', 'Energy', 'Financials',
       'Health Care/Biotech', 'Industrials', 'Large Cap Blend',
       'Large Cap Growth', 'Large Cap Value', 'Mid Cap Blend',
       'Mid Cap Growth', 'Mid Cap Value', 'Real Estate',
       'Small Cap Blend', 'Small Cap Growth', 'Small Cap Value',
       'Technology', 'Telecom', 'Utilities', 'Infrastructure'],
      dtype=object)
In [ ]:
# exclude industries with missing data observations
industries = ['Energy', 'Financials',
       'Health Care/Biotech', 'Large Cap Blend',
       'Large Cap Growth', 'Large Cap Value', 'Mid Cap Blend',
       'Mid Cap Growth', 'Mid Cap Value', 'Real Estate',
       'Small Cap Blend', 'Small Cap Growth', 'Small Cap Value',
       'Technology', 'Telecom', 'Utilities']
In [ ]:
len(industries)
Out[ ]:
16

Stationary Check

All subsets of flow (with different types and industries) are tested to be stastionary.

In [ ]:
#Ho: It is non-stationary
#H1: It is stationary

# self define func
def adfuller_test(flow):
    result = adfuller(flow)
    labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations']
    for value,label in zip(result,labels):
        print(label+' : '+str(value) )

    if result[1] <= 0.05:
      print("Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.")
    else:
      print("weak evidence against null hypothesis,indicating it is non-stationary.")
In [ ]:
# all sub flows are stationary
for t in data.Type.unique():
  # iterating for 16 industries 
  for i in industries:
    print('Stationary check for type {} in {} industry:'.format(t, i))
    temp = data.loc[(data.Type == t) & (data.Industry == i)][['ReportDate', 'Flow']]
    temp_test = temp['Flow']
    adfuller_test(temp_test)
    print('-'*50)
Stationary check for type ETF in Energy industry:
ADF Test Statistic : -21.07813363393672
p-value : 0.0
#Lags Used : 1
Number of Observations : 577
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Financials industry:
ADF Test Statistic : -7.130444759126436
p-value : 3.529128253702001e-10
#Lags Used : 18
Number of Observations : 560
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Health Care/Biotech industry:
ADF Test Statistic : -4.133572223520001
p-value : 0.0008519063114161399
#Lags Used : 19
Number of Observations : 559
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Large Cap Blend industry:
ADF Test Statistic : -6.0650270399111905
p-value : 1.1874428447413857e-07
#Lags Used : 12
Number of Observations : 566
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Large Cap Growth industry:
ADF Test Statistic : -5.382254970743792
p-value : 3.690344369679783e-06
#Lags Used : 18
Number of Observations : 560
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Large Cap Value industry:
ADF Test Statistic : -8.412933824440003
p-value : 2.0802029449220027e-13
#Lags Used : 4
Number of Observations : 574
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Mid Cap Blend industry:
ADF Test Statistic : -6.335875047653108
p-value : 2.8316506819471788e-08
#Lags Used : 13
Number of Observations : 565
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Mid Cap Growth industry:
ADF Test Statistic : -5.179072324016065
p-value : 9.699522814181292e-06
#Lags Used : 12
Number of Observations : 566
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Mid Cap Value industry:
ADF Test Statistic : -4.98267933979688
p-value : 2.401320100266466e-05
#Lags Used : 12
Number of Observations : 566
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Real Estate industry:
ADF Test Statistic : -24.413518892682724
p-value : 0.0
#Lags Used : 0
Number of Observations : 578
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Small Cap Blend industry:
ADF Test Statistic : -8.967765672696364
p-value : 7.903704904405242e-15
#Lags Used : 9
Number of Observations : 569
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Small Cap Growth industry:
ADF Test Statistic : -6.081262817044422
p-value : 1.0907822599242858e-07
#Lags Used : 13
Number of Observations : 565
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Small Cap Value industry:
ADF Test Statistic : -7.357481861864091
p-value : 9.69445461269432e-11
#Lags Used : 12
Number of Observations : 566
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Technology industry:
ADF Test Statistic : -21.168476415693885
p-value : 0.0
#Lags Used : 0
Number of Observations : 578
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Telecom industry:
ADF Test Statistic : -10.959103159585563
p-value : 8.414400771928266e-20
#Lags Used : 2
Number of Observations : 576
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type ETF in Utilities industry:
ADF Test Statistic : -7.085919246416656
p-value : 4.539003977765473e-10
#Lags Used : 11
Number of Observations : 567
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Energy industry:
ADF Test Statistic : -6.478094090661446
p-value : 1.3157041919648519e-08
#Lags Used : 7
Number of Observations : 571
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Financials industry:
ADF Test Statistic : -25.42824796445781
p-value : 0.0
#Lags Used : 0
Number of Observations : 578
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Health Care/Biotech industry:
ADF Test Statistic : -4.061636128459828
p-value : 0.0011200214233059646
#Lags Used : 19
Number of Observations : 559
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Large Cap Blend industry:
ADF Test Statistic : -6.250603010743232
p-value : 4.4638535763684026e-08
#Lags Used : 12
Number of Observations : 566
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Large Cap Growth industry:
ADF Test Statistic : -22.607433869735612
p-value : 0.0
#Lags Used : 0
Number of Observations : 578
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Large Cap Value industry:
ADF Test Statistic : -8.434321368006978
p-value : 1.8340936344071603e-13
#Lags Used : 4
Number of Observations : 574
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Mid Cap Blend industry:
ADF Test Statistic : -6.103291643565583
p-value : 9.718798449422412e-08
#Lags Used : 13
Number of Observations : 565
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Mid Cap Growth industry:
ADF Test Statistic : -5.697739607708634
p-value : 7.803259273879385e-07
#Lags Used : 6
Number of Observations : 572
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Mid Cap Value industry:
ADF Test Statistic : -5.885427327277903
p-value : 3.0095003186737737e-07
#Lags Used : 5
Number of Observations : 573
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Real Estate industry:
ADF Test Statistic : -14.773103088485318
p-value : 2.3061654286366015e-27
#Lags Used : 1
Number of Observations : 577
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Small Cap Blend industry:
ADF Test Statistic : -16.427798629706494
p-value : 2.4950351968641158e-29
#Lags Used : 3
Number of Observations : 575
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Small Cap Growth industry:
ADF Test Statistic : -4.4943413277609405
p-value : 0.00020132163754303192
#Lags Used : 12
Number of Observations : 566
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Small Cap Value industry:
ADF Test Statistic : -6.723171722828416
p-value : 3.4418703770205606e-09
#Lags Used : 7
Number of Observations : 571
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Technology industry:
ADF Test Statistic : -20.904684580410336
p-value : 0.0
#Lags Used : 0
Number of Observations : 578
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Telecom industry:
ADF Test Statistic : -10.959103159585563
p-value : 8.414400771928266e-20
#Lags Used : 2
Number of Observations : 576
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type IMF in Utilities industry:
ADF Test Statistic : -7.031348858855168
p-value : 6.173719656265424e-10
#Lags Used : 11
Number of Observations : 567
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Energy industry:
ADF Test Statistic : -3.851604104430066
p-value : 0.0024213366409842783
#Lags Used : 8
Number of Observations : 570
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Financials industry:
ADF Test Statistic : -6.431915984825449
p-value : 1.6891600413517873e-08
#Lags Used : 5
Number of Observations : 573
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Health Care/Biotech industry:
ADF Test Statistic : -7.785238847456829
p-value : 8.213183448623165e-12
#Lags Used : 3
Number of Observations : 575
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Large Cap Blend industry:
ADF Test Statistic : -9.11567837154558
p-value : 3.3074983955384967e-15
#Lags Used : 3
Number of Observations : 575
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Large Cap Growth industry:
ADF Test Statistic : -5.347765051781709
p-value : 4.356739127378725e-06
#Lags Used : 10
Number of Observations : 568
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Large Cap Value industry:
ADF Test Statistic : -4.412860260588984
p-value : 0.00028165741634763735
#Lags Used : 8
Number of Observations : 570
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Mid Cap Blend industry:
ADF Test Statistic : -3.8503519184249515
p-value : 0.0024321823197618786
#Lags Used : 12
Number of Observations : 566
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Mid Cap Growth industry:
ADF Test Statistic : -5.179578736659455
p-value : 9.676530750966223e-06
#Lags Used : 9
Number of Observations : 569
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Mid Cap Value industry:
ADF Test Statistic : -7.01437620536648
p-value : 6.792220920028543e-10
#Lags Used : 4
Number of Observations : 574
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Real Estate industry:
ADF Test Statistic : -4.868270722216884
p-value : 4.018833788584689e-05
#Lags Used : 16
Number of Observations : 562
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Small Cap Blend industry:
ADF Test Statistic : -5.459615955281207
p-value : 2.535860818469892e-06
#Lags Used : 5
Number of Observations : 573
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Small Cap Growth industry:
ADF Test Statistic : -3.814719426578998
p-value : 0.0027602557809922367
#Lags Used : 8
Number of Observations : 570
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Small Cap Value industry:
ADF Test Statistic : -6.0380465888912624
p-value : 1.3669728777658914e-07
#Lags Used : 4
Number of Observations : 574
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Technology industry:
ADF Test Statistic : -8.69135280699107
p-value : 4.0325918902799045e-14
#Lags Used : 3
Number of Observations : 575
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Telecom industry:
ADF Test Statistic : -15.09765588162961
p-value : 8.0244614554203345e-28
#Lags Used : 3
Number of Observations : 575
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------
Stationary check for type RMF in Utilities industry:
ADF Test Statistic : -5.9504688676283095
p-value : 2.153263010411509e-07
#Lags Used : 5
Number of Observations : 573
Strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary.
--------------------------------------------------

Moving Average

In [ ]:
from sklearn.metrics import mean_squared_error
In [ ]:
# initialize empty dataframe to store moving avgerage estimation results
df_ma = pd.DataFrame()
In [ ]:
# iterating for 3 types
for t in data.Type.unique():
  # iterating for 16 industries 
  for i in industries:
    # ma window
    for n in range(1, 21):
      sub = data.loc[(data.Type == t) & (data.Industry == i)][['ReportDate', 'Flow']]
      # no burn-in period assumed, predict the last 28 observations
      mse = mean_squared_error(sub["Flow"][550:], sub["Flow"].rolling(n).mean()[549:578])
      df_ma = df_ma.append({'Type': t,
                            'Industry': i,
                            'MA window': n,
                            'MA_mse': mse}, ignore_index = True
                           )
      # print('With window = {}, the corresponding mse in {} Industry with {} Type is {}.'.format(n, i, t, mse))
In [ ]:
# groupby mse values by Type and Industries, and sort values by mse
df_ma = df_ma.groupby(['Type','Industry']).apply(lambda x: x.sort_values(["MA_mse"], ascending = True)).reset_index(drop=True)
In [ ]:
# store best result
df_ma_final = pd.DataFrame()

for i in range(0, 960, 20):
  df_ma_final = df_ma_final.append(df_ma.iloc[i])
In [ ]:
# reset index
df_ma_final.reset_index(inplace=True)
# remove first column
df_ma_final = df_ma_final.iloc[:, 1:]
# ma window as int
df_ma_final['MA window'] = df_ma_final['MA window'].astype('int')

df_ma_final
Out[ ]:
Type Industry MA window MA_mse
0 ETF Energy 13 1.683463e+05
1 ETF Financials 11 5.298055e+04
2 ETF Health Care/Biotech 14 1.037090e+04
3 ETF Large Cap Blend 20 3.255064e+06
4 ETF Large Cap Growth 20 1.427495e+05
5 ETF Large Cap Value 17 1.565841e+05
6 ETF Mid Cap Blend 20 3.422931e+04
7 ETF Mid Cap Growth 6 1.773055e+03
8 ETF Mid Cap Value 9 2.273634e+04
9 ETF Real Estate 18 4.950357e+04
10 ETF Small Cap Blend 19 1.068476e+06
11 ETF Small Cap Growth 20 6.899885e+03
12 ETF Small Cap Value 8 2.043102e+04
13 ETF Technology 7 1.866661e+03
14 ETF Telecom 16 6.866296e+02
15 ETF Utilities 15 8.215396e+03
16 IMF Energy 13 1.683224e+05
17 IMF Financials 11 5.297828e+04
18 IMF Health Care/Biotech 14 1.045610e+04
19 IMF Large Cap Blend 18 3.307342e+06
20 IMF Large Cap Growth 20 2.041235e+05
21 IMF Large Cap Value 17 1.566027e+05
22 IMF Mid Cap Blend 20 3.241088e+04
23 IMF Mid Cap Growth 10 1.419618e+04
24 IMF Mid Cap Value 9 2.302436e+04
25 IMF Real Estate 14 5.037292e+04
26 IMF Small Cap Blend 19 1.110367e+06
27 IMF Small Cap Growth 17 9.586868e+03
28 IMF Small Cap Value 8 2.132454e+04
29 IMF Technology 4 2.795837e+04
30 IMF Telecom 16 6.866296e+02
31 IMF Utilities 15 8.316653e+03
32 RMF Energy 2 1.613458e+02
33 RMF Financials 4 3.215849e+00
34 RMF Health Care/Biotech 2 3.452898e+02
35 RMF Large Cap Blend 18 1.380561e+06
36 RMF Large Cap Growth 20 1.528636e+06
37 RMF Large Cap Value 13 6.658096e+04
38 RMF Mid Cap Blend 15 2.717574e+03
39 RMF Mid Cap Growth 20 7.562105e+04
40 RMF Mid Cap Value 5 7.790957e+02
41 RMF Real Estate 8 6.181676e+01
42 RMF Small Cap Blend 4 9.949174e+03
43 RMF Small Cap Growth 15 6.799021e+03
44 RMF Small Cap Value 19 3.889656e+03
45 RMF Technology 13 2.019848e+02
46 RMF Telecom 6 1.619449e-01
47 RMF Utilities 4 7.684388e+01
In [ ]:
# iterating for flow & MA window saved in above dataframe
for i in range(df_ma_final.shape[0]):
  # subset 
  sub = data.loc[(data.Type == df_ma_final.iloc[i].Type) & (data.Industry == df_ma_final.iloc[i].Industry)][['ReportDate', 'Flow']]
  # reorder sequence
  sub.sort_values('ReportDate', ascending  = True, inplace=True)
  # train & test split
  train = sub.iloc[:551]
  test = sub.iloc[550:]
  # predict with best ma window
  pred_ma = test.Flow.rolling(df_ma_final.iloc[i]['MA window'], min_periods = 1).mean()
  # plot
  plt.figure(figsize=(40,3))
  plt.plot(train.ReportDate, train.Flow, c = 'b', label = 'Train')
  plt.plot(test.ReportDate, test.Flow, c = 'orange', label = 'Test')
  ma_window = "MA window" + str(df_ma_final.iloc[i]['MA window'])
  plt.plot(test.ReportDate, pred_ma, c = 'grey', label = ma_window)
  plt.title('Moving Average with Window = {} for {} of {} Type'.format(df_ma_final.iloc[i]['MA window'], df_ma_final.iloc[i]['Industry'], df_ma_final.iloc[i]['Type']))
  plt.xlabel('Report Date')
  plt.ylabel('Flow')
  plt.legend()
  plt.show()
In [ ]:
# if needed to store output
# df_ma_final.to_csv('moving_avg_output.csv')

Simple Exponential Smoothing

In [ ]:
df_ses = pd.DataFrame()
In [ ]:
# iterating for 3 types
for t in data.Type.unique():
  # iterating for 16 industries 
  for i in industries:
    # ma window
    sub = data.loc[(data.Type == t) & (data.Industry == i)][['ReportDate', 'Flow']]
    for alpha in np.arange(0.1, 1.1, 0.1):
      # calculating mse
      mse = mean_squared_error(sub.Flow[550:],ExponentialSmoothing(sub.Flow).fit(smoothing_level = alpha).predict(550,))

      df_ses = df_ses.append({'Type': t,
                            'Industry': i,
                            'SES alpha': alpha,
                            'SES mse': mse}, ignore_index = True
                           )
In [ ]:
df_ses
Out[ ]:
Type Industry SES alpha SES mse
0 ETF Energy 0.1 181055.909260
1 ETF Energy 0.2 196442.184190
2 ETF Energy 0.3 211393.780202
3 ETF Energy 0.4 225551.850059
4 ETF Energy 0.5 238887.927022
... ... ... ... ...
475 RMF Utilities 0.6 78.594105
476 RMF Utilities 0.7 82.907386
477 RMF Utilities 0.8 88.519743
478 RMF Utilities 0.9 95.753710
479 RMF Utilities 1.0 105.056896

480 rows × 4 columns

In [ ]:
# groupby mse values by Type and Industries, and sort values by mse
df_ses = df_ses.groupby(['Type','Industry']).apply(lambda x: x.sort_values(["SES mse"], ascending = True)).reset_index(drop=True)
In [ ]:
# store best result
df_ses_final = pd.DataFrame()

for i in range(0, 480, 10):
  df_ses_final = df_ses_final.append(df_ses.iloc[i])
In [ ]:
df_ses_final.reset_index(inplace=True)
df_ses_final = df_ses_final.iloc[:, 1:]
df_ses_final
Out[ ]:
Type Industry SES alpha SES mse
0 ETF Energy 0.1 1.810559e+05
1 ETF Financials 0.1 5.545442e+04
2 ETF Health Care/Biotech 0.1 1.063216e+04
3 ETF Large Cap Blend 0.1 3.374319e+06
4 ETF Large Cap Growth 0.1 1.492183e+05
5 ETF Large Cap Value 0.1 1.608566e+05
6 ETF Mid Cap Blend 0.1 3.649100e+04
7 ETF Mid Cap Growth 0.2 1.829638e+03
8 ETF Mid Cap Value 0.1 2.339144e+04
9 ETF Real Estate 0.1 5.241447e+04
10 ETF Small Cap Blend 0.1 1.144424e+06
11 ETF Small Cap Growth 0.1 7.163444e+03
12 ETF Small Cap Value 0.1 2.144024e+04
13 ETF Technology 0.3 1.968853e+03
14 ETF Telecom 0.1 7.368029e+02
15 ETF Utilities 0.1 8.744426e+03
16 IMF Energy 0.1 1.810334e+05
17 IMF Financials 0.1 5.545295e+04
18 IMF Health Care/Biotech 0.1 1.069554e+04
19 IMF Large Cap Blend 0.1 3.441461e+06
20 IMF Large Cap Growth 0.1 2.163060e+05
21 IMF Large Cap Value 0.1 1.631840e+05
22 IMF Mid Cap Blend 0.1 3.417704e+04
23 IMF Mid Cap Growth 0.1 1.430911e+04
24 IMF Mid Cap Value 0.1 2.381932e+04
25 IMF Real Estate 0.1 5.334851e+04
26 IMF Small Cap Blend 0.1 1.191332e+06
27 IMF Small Cap Growth 0.1 9.798641e+03
28 IMF Small Cap Value 0.1 2.232227e+04
29 IMF Technology 0.1 2.830879e+04
30 IMF Telecom 0.1 7.368029e+02
31 IMF Utilities 0.1 8.790759e+03
32 RMF Energy 0.6 1.516567e+02
33 RMF Financials 0.4 3.168411e+00
34 RMF Health Care/Biotech 0.6 3.171519e+02
35 RMF Large Cap Blend 0.1 1.443019e+06
36 RMF Large Cap Growth 0.1 1.574836e+06
37 RMF Large Cap Value 0.1 7.028869e+04
38 RMF Mid Cap Blend 0.1 2.776529e+03
39 RMF Mid Cap Growth 0.1 7.665091e+04
40 RMF Mid Cap Value 0.2 8.191287e+02
41 RMF Real Estate 0.3 6.632460e+01
42 RMF Small Cap Blend 0.2 1.015928e+04
43 RMF Small Cap Growth 0.1 6.947379e+03
44 RMF Small Cap Value 0.1 3.937471e+03
45 RMF Technology 0.1 2.055814e+02
46 RMF Telecom 0.1 1.530775e-01
47 RMF Utilities 0.3 6.982098e+01
In [ ]:
# iterating for flow & SES smoothing factor saved in above dataframe
for i in range(df_ses_final.shape[0]):
  # subset 
  sub = data.loc[(data.Type == df_ses_final.iloc[i].Type) & (data.Industry == df_ses_final.iloc[i].Industry)][['ReportDate', 'Flow']]
  # reorder sequence
  sub.sort_values('ReportDate', ascending  = True, inplace=True)
  # train & test split
  train = sub.iloc[:551]
  test = sub.iloc[550:]
  # predict with best ses alpha
  pred_ses = ExponentialSmoothing(sub.Flow).fit(smoothing_level = df_ses_final.iloc[i]['SES alpha']).predict(550,)
  # plot
  plt.figure(figsize=(40,3))
  plt.plot(train.ReportDate, train.Flow, c = 'b', label = 'Train')
  plt.plot(test.ReportDate, test.Flow, c = 'orange', label = 'Test')
  ses_alpha = "SES smoothing factor" + str(df_ses_final.iloc[i]['SES alpha'])
  plt.plot(test.ReportDate, pred_ses, c = 'grey', label = ses_alpha)
  plt.title('Simple Exponential Smoothing with alpha = {} for {} of {} Type'.format(df_ses_final.iloc[i]['SES alpha'], df_ses_final.iloc[i]['Industry'], df_ses_final.iloc[i]['Type']))
  plt.xlabel('Report Date')
  plt.ylabel('Flow')
  plt.legend()
  plt.show()
In [ ]:
# if needed to store SES result
#df_ses_final.to_csv('ses_output.csv')

Assess MA vs SES

In [ ]:
# compare two model result

comparision = df_ma_final.copy()
comparision[['SES alpha', 'SES_mse']] = df_ses_final[['SES alpha', 'SES mse']]
# comparision.drop(columns='SES mse', inplace=True)
comparision
Out[ ]:
Type Industry MA window MA_mse SES alpha SES_mse
0 ETF Energy 13 1.683463e+05 0.1 1.810559e+05
1 ETF Financials 11 5.298055e+04 0.1 5.545442e+04
2 ETF Health Care/Biotech 14 1.037090e+04 0.1 1.063216e+04
3 ETF Large Cap Blend 20 3.255064e+06 0.1 3.374319e+06
4 ETF Large Cap Growth 20 1.427495e+05 0.1 1.492183e+05
5 ETF Large Cap Value 17 1.565841e+05 0.1 1.608566e+05
6 ETF Mid Cap Blend 20 3.422931e+04 0.1 3.649100e+04
7 ETF Mid Cap Growth 6 1.773055e+03 0.2 1.829638e+03
8 ETF Mid Cap Value 9 2.273634e+04 0.1 2.339144e+04
9 ETF Real Estate 18 4.950357e+04 0.1 5.241447e+04
10 ETF Small Cap Blend 19 1.068476e+06 0.1 1.144424e+06
11 ETF Small Cap Growth 20 6.899885e+03 0.1 7.163444e+03
12 ETF Small Cap Value 8 2.043102e+04 0.1 2.144024e+04
13 ETF Technology 7 1.866661e+03 0.3 1.968853e+03
14 ETF Telecom 16 6.866296e+02 0.1 7.368029e+02
15 ETF Utilities 15 8.215396e+03 0.1 8.744426e+03
16 IMF Energy 13 1.683224e+05 0.1 1.810334e+05
17 IMF Financials 11 5.297828e+04 0.1 5.545295e+04
18 IMF Health Care/Biotech 14 1.045610e+04 0.1 1.069554e+04
19 IMF Large Cap Blend 18 3.307342e+06 0.1 3.441461e+06
20 IMF Large Cap Growth 20 2.041235e+05 0.1 2.163060e+05
21 IMF Large Cap Value 17 1.566027e+05 0.1 1.631840e+05
22 IMF Mid Cap Blend 20 3.241088e+04 0.1 3.417704e+04
23 IMF Mid Cap Growth 10 1.419618e+04 0.1 1.430911e+04
24 IMF Mid Cap Value 9 2.302436e+04 0.1 2.381932e+04
25 IMF Real Estate 14 5.037292e+04 0.1 5.334851e+04
26 IMF Small Cap Blend 19 1.110367e+06 0.1 1.191332e+06
27 IMF Small Cap Growth 17 9.586868e+03 0.1 9.798641e+03
28 IMF Small Cap Value 8 2.132454e+04 0.1 2.232227e+04
29 IMF Technology 4 2.795837e+04 0.1 2.830879e+04
30 IMF Telecom 16 6.866296e+02 0.1 7.368029e+02
31 IMF Utilities 15 8.316653e+03 0.1 8.790759e+03
32 RMF Energy 2 1.613458e+02 0.6 1.516567e+02
33 RMF Financials 4 3.215849e+00 0.4 3.168411e+00
34 RMF Health Care/Biotech 2 3.452898e+02 0.6 3.171519e+02
35 RMF Large Cap Blend 18 1.380561e+06 0.1 1.443019e+06
36 RMF Large Cap Growth 20 1.528636e+06 0.1 1.574836e+06
37 RMF Large Cap Value 13 6.658096e+04 0.1 7.028869e+04
38 RMF Mid Cap Blend 15 2.717574e+03 0.1 2.776529e+03
39 RMF Mid Cap Growth 20 7.562105e+04 0.1 7.665091e+04
40 RMF Mid Cap Value 5 7.790957e+02 0.2 8.191287e+02
41 RMF Real Estate 8 6.181676e+01 0.3 6.632460e+01
42 RMF Small Cap Blend 4 9.949174e+03 0.2 1.015928e+04
43 RMF Small Cap Growth 15 6.799021e+03 0.1 6.947379e+03
44 RMF Small Cap Value 19 3.889656e+03 0.1 3.937471e+03
45 RMF Technology 13 2.019848e+02 0.1 2.055814e+02
46 RMF Telecom 6 1.619449e-01 0.1 1.530775e-01
47 RMF Utilities 4 7.684388e+01 0.3 6.982098e+01
In [ ]:
comparision['better model'] = np.where(comparision['MA_mse'] > comparision['SES_mse'], 'SES', 'MA')
comparision
Out[ ]:
Type Industry MA window MA_mse SES alpha SES_mse better model
0 ETF Energy 13 1.683463e+05 0.1 1.810559e+05 MA
1 ETF Financials 11 5.298055e+04 0.1 5.545442e+04 MA
2 ETF Health Care/Biotech 14 1.037090e+04 0.1 1.063216e+04 MA
3 ETF Large Cap Blend 20 3.255064e+06 0.1 3.374319e+06 MA
4 ETF Large Cap Growth 20 1.427495e+05 0.1 1.492183e+05 MA
5 ETF Large Cap Value 17 1.565841e+05 0.1 1.608566e+05 MA
6 ETF Mid Cap Blend 20 3.422931e+04 0.1 3.649100e+04 MA
7 ETF Mid Cap Growth 6 1.773055e+03 0.2 1.829638e+03 MA
8 ETF Mid Cap Value 9 2.273634e+04 0.1 2.339144e+04 MA
9 ETF Real Estate 18 4.950357e+04 0.1 5.241447e+04 MA
10 ETF Small Cap Blend 19 1.068476e+06 0.1 1.144424e+06 MA
11 ETF Small Cap Growth 20 6.899885e+03 0.1 7.163444e+03 MA
12 ETF Small Cap Value 8 2.043102e+04 0.1 2.144024e+04 MA
13 ETF Technology 7 1.866661e+03 0.3 1.968853e+03 MA
14 ETF Telecom 16 6.866296e+02 0.1 7.368029e+02 MA
15 ETF Utilities 15 8.215396e+03 0.1 8.744426e+03 MA
16 IMF Energy 13 1.683224e+05 0.1 1.810334e+05 MA
17 IMF Financials 11 5.297828e+04 0.1 5.545295e+04 MA
18 IMF Health Care/Biotech 14 1.045610e+04 0.1 1.069554e+04 MA
19 IMF Large Cap Blend 18 3.307342e+06 0.1 3.441461e+06 MA
20 IMF Large Cap Growth 20 2.041235e+05 0.1 2.163060e+05 MA
21 IMF Large Cap Value 17 1.566027e+05 0.1 1.631840e+05 MA
22 IMF Mid Cap Blend 20 3.241088e+04 0.1 3.417704e+04 MA
23 IMF Mid Cap Growth 10 1.419618e+04 0.1 1.430911e+04 MA
24 IMF Mid Cap Value 9 2.302436e+04 0.1 2.381932e+04 MA
25 IMF Real Estate 14 5.037292e+04 0.1 5.334851e+04 MA
26 IMF Small Cap Blend 19 1.110367e+06 0.1 1.191332e+06 MA
27 IMF Small Cap Growth 17 9.586868e+03 0.1 9.798641e+03 MA
28 IMF Small Cap Value 8 2.132454e+04 0.1 2.232227e+04 MA
29 IMF Technology 4 2.795837e+04 0.1 2.830879e+04 MA
30 IMF Telecom 16 6.866296e+02 0.1 7.368029e+02 MA
31 IMF Utilities 15 8.316653e+03 0.1 8.790759e+03 MA
32 RMF Energy 2 1.613458e+02 0.6 1.516567e+02 SES
33 RMF Financials 4 3.215849e+00 0.4 3.168411e+00 SES
34 RMF Health Care/Biotech 2 3.452898e+02 0.6 3.171519e+02 SES
35 RMF Large Cap Blend 18 1.380561e+06 0.1 1.443019e+06 MA
36 RMF Large Cap Growth 20 1.528636e+06 0.1 1.574836e+06 MA
37 RMF Large Cap Value 13 6.658096e+04 0.1 7.028869e+04 MA
38 RMF Mid Cap Blend 15 2.717574e+03 0.1 2.776529e+03 MA
39 RMF Mid Cap Growth 20 7.562105e+04 0.1 7.665091e+04 MA
40 RMF Mid Cap Value 5 7.790957e+02 0.2 8.191287e+02 MA
41 RMF Real Estate 8 6.181676e+01 0.3 6.632460e+01 MA
42 RMF Small Cap Blend 4 9.949174e+03 0.2 1.015928e+04 MA
43 RMF Small Cap Growth 15 6.799021e+03 0.1 6.947379e+03 MA
44 RMF Small Cap Value 19 3.889656e+03 0.1 3.937471e+03 MA
45 RMF Technology 13 2.019848e+02 0.1 2.055814e+02 MA
46 RMF Telecom 6 1.619449e-01 0.1 1.530775e-01 SES
47 RMF Utilities 4 7.684388e+01 0.3 6.982098e+01 SES
In [ ]:
# overall MA models generate a better performance in predicting short-term flow change
comparision['better model'].value_counts()
Out[ ]:
MA     43
SES     5
Name: better model, dtype: int64
In [ ]:
# plot
comparision['better model'].value_counts().plot(kind = 'bar')
plt.title('Assess MA* vs. SES*')
plt.show()

ARMA Estimations

Autocorrelation & Partial Autocorrelation plots

In [ ]:
# acf & pacf plot  

# iterating for 3 types
for t in data.Type.unique():
  # iterating for 16 industries 
  for i in industries:
    print('With fund flow in {} Industry & {} Type:'.format(i, t))
    sub = data.loc[(data.Type == t) & (data.Industry == i)][['ReportDate', 'Flow']]
    fig, ax = plt.subplots(2, figsize=(12,6))
    ax[0] = plot_acf(sub.Flow, ax=ax[0], lags=20)
    ax[1] = plot_pacf(sub.Flow, ax=ax[1], lags=20)
    plt.xlabel('Lag')
    plt.show()
    print('-'*50)
With fund flow in Energy Industry & ETF Type:
--------------------------------------------------
With fund flow in Financials Industry & ETF Type:
--------------------------------------------------
With fund flow in Health Care/Biotech Industry & ETF Type:
--------------------------------------------------
With fund flow in Large Cap Blend Industry & ETF Type:
--------------------------------------------------
With fund flow in Large Cap Growth Industry & ETF Type:
--------------------------------------------------
With fund flow in Large Cap Value Industry & ETF Type:
--------------------------------------------------
With fund flow in Mid Cap Blend Industry & ETF Type:
--------------------------------------------------
With fund flow in Mid Cap Growth Industry & ETF Type:
--------------------------------------------------
With fund flow in Mid Cap Value Industry & ETF Type:
--------------------------------------------------
With fund flow in Real Estate Industry & ETF Type:
--------------------------------------------------
With fund flow in Small Cap Blend Industry & ETF Type:
--------------------------------------------------
With fund flow in Small Cap Growth Industry & ETF Type:
--------------------------------------------------
With fund flow in Small Cap Value Industry & ETF Type:
--------------------------------------------------
With fund flow in Technology Industry & ETF Type:
--------------------------------------------------
With fund flow in Telecom Industry & ETF Type:
--------------------------------------------------
With fund flow in Utilities Industry & ETF Type:
--------------------------------------------------
With fund flow in Energy Industry & IMF Type:
--------------------------------------------------
With fund flow in Financials Industry & IMF Type:
--------------------------------------------------
With fund flow in Health Care/Biotech Industry & IMF Type:
--------------------------------------------------
With fund flow in Large Cap Blend Industry & IMF Type:
--------------------------------------------------
With fund flow in Large Cap Growth Industry & IMF Type:
--------------------------------------------------
With fund flow in Large Cap Value Industry & IMF Type:
--------------------------------------------------
With fund flow in Mid Cap Blend Industry & IMF Type:
--------------------------------------------------
With fund flow in Mid Cap Growth Industry & IMF Type:
--------------------------------------------------
With fund flow in Mid Cap Value Industry & IMF Type:
--------------------------------------------------
With fund flow in Real Estate Industry & IMF Type:
--------------------------------------------------
With fund flow in Small Cap Blend Industry & IMF Type:
--------------------------------------------------
With fund flow in Small Cap Growth Industry & IMF Type:
--------------------------------------------------
With fund flow in Small Cap Value Industry & IMF Type:
--------------------------------------------------
With fund flow in Technology Industry & IMF Type:
--------------------------------------------------
With fund flow in Telecom Industry & IMF Type:
--------------------------------------------------
With fund flow in Utilities Industry & IMF Type:
--------------------------------------------------
With fund flow in Energy Industry & RMF Type:
--------------------------------------------------
With fund flow in Financials Industry & RMF Type:
--------------------------------------------------
With fund flow in Health Care/Biotech Industry & RMF Type:
--------------------------------------------------
With fund flow in Large Cap Blend Industry & RMF Type:
--------------------------------------------------
With fund flow in Large Cap Growth Industry & RMF Type:
--------------------------------------------------
With fund flow in Large Cap Value Industry & RMF Type:
--------------------------------------------------
With fund flow in Mid Cap Blend Industry & RMF Type:
--------------------------------------------------
With fund flow in Mid Cap Growth Industry & RMF Type:
--------------------------------------------------
With fund flow in Mid Cap Value Industry & RMF Type:
--------------------------------------------------
With fund flow in Real Estate Industry & RMF Type:
--------------------------------------------------
With fund flow in Small Cap Blend Industry & RMF Type:
--------------------------------------------------
With fund flow in Small Cap Growth Industry & RMF Type:
--------------------------------------------------
With fund flow in Small Cap Value Industry & RMF Type:
--------------------------------------------------
With fund flow in Technology Industry & RMF Type:
--------------------------------------------------
With fund flow in Telecom Industry & RMF Type:
--------------------------------------------------
With fund flow in Utilities Industry & RMF Type:
--------------------------------------------------

ARMA order

In [ ]:
df_arma = pd.DataFrame()
In [ ]:
## due to stats lib updates, ARMA is not available
from statsmodels.tsa.arima_model import ARIMA
In [ ]:
# iterating for 3 types
for t in data.Type.unique():
  # iterating for 16 industries 
  for i in industries:
    # subset
    sub = data.loc[(data.Type == t) & (data.Industry == i)][['ReportDate', 'Flow']]

    # train & test split
    train = sub.Flow[:551]
    test = sub.Flow[550:]

    # ar param
    for p in range(4):
      # ma param
      for q in range(1, 4):
        # in case model order not applicable
        try:
          # initialize model
          model = ARIMA(train, order = (p, 0, q))
          # fit
          model_fit = model.fit()
          # model performance = aic + bic, the smaller the better
          res = model_fit.aic + model_fit.bic

          df_arma = df_arma.append({'Type': t,
                                'Industry': i,
                                'order': (p, q),
                                'performance': res}, ignore_index = True)
        except:
          continue
        # print('With window = {}, the corresponding mse in {} Industry with {} Type is {}.'.format(n, i, t, mse))
In [ ]:
# groupby mse values by Type and Industries, and sort values by mse
df_arma = df_arma.groupby(['Type','Industry']).apply(lambda x: x.sort_values(["performance"], ascending = True)).reset_index(drop=True)
In [ ]:
# store best result
df_arma_final = pd.DataFrame()
In [ ]:
# iterating for 3 types
for t in data.Type.unique():
  # iterating for 16 industries 
  for i in industries:
    # subset
    best = df_arma.loc[(df_arma.Type == t) & (df_arma.Industry == i)].iloc[0]
    df_arma_final = df_arma_final.append(best)
In [ ]:
df_arma_final
Out[ ]:
Type Industry order performance
0 ETF Energy (0, 1) 16550.632041
11 ETF Financials (3, 2) 17244.091640
19 ETF Health Care/Biotech (1, 2) 15711.986010
28 ETF Large Cap Blend (0, 1) 21365.890938
36 ETF Large Cap Growth (2, 1) 17625.097701
43 ETF Large Cap Value (2, 1) 16794.606375
50 ETF Mid Cap Blend (2, 2) 17088.323170
58 ETF Mid Cap Growth (2, 1) 13058.888455
67 ETF Mid Cap Value (2, 1) 14104.082401
73 ETF Real Estate (0, 1) 15683.259062
79 ETF Small Cap Blend (0, 1) 18774.066850
91 ETF Small Cap Growth (2, 2) 13946.573468
102 ETF Small Cap Value (2, 1) 14119.076649
111 ETF Technology (0, 1) 15280.685732
120 ETF Telecom (0, 3) 10513.715094
129 ETF Utilities (1, 1) 14975.303732
138 IMF Energy (2, 2) 16560.948516
147 IMF Financials (3, 2) 17246.959287
155 IMF Health Care/Biotech (1, 2) 15739.296400
164 IMF Large Cap Blend (0, 1) 21371.404372
173 IMF Large Cap Growth (1, 1) 17846.702260
183 IMF Large Cap Value (2, 1) 17088.789756
189 IMF Mid Cap Blend (3, 1) 17184.436913
195 IMF Mid Cap Growth (1, 1) 14120.121891
207 IMF Mid Cap Value (1, 1) 14442.720601
219 IMF Real Estate (0, 2) 15755.954788
225 IMF Small Cap Blend (0, 1) 18790.276398
237 IMF Small Cap Growth (3, 1) 14421.494447
244 IMF Small Cap Value (2, 1) 14259.032509
250 IMF Technology (0, 1) 15289.910175
259 IMF Telecom (0, 3) 10513.715094
268 IMF Utilities (1, 1) 14995.926800
277 RMF Energy (1, 1) 12691.630175
288 RMF Financials (1, 2) 10451.833203
300 RMF Health Care/Biotech (1, 1) 11382.968332
309 RMF Large Cap Blend (1, 1) 16371.592114
321 RMF Large Cap Growth (2, 1) 16316.098258
327 RMF Large Cap Value (1, 1) 16436.333470
337 RMF Mid Cap Blend (1, 1) 13896.729295
345 RMF Mid Cap Growth (2, 1) 15036.822414
353 RMF Mid Cap Value (1, 1) 13079.279648
365 RMF Real Estate (3, 3) 12490.866578
377 RMF Small Cap Blend (1, 2) 14057.396730
388 RMF Small Cap Growth (1, 1) 13838.665384
400 RMF Small Cap Value (1, 2) 12748.170851
412 RMF Technology (1, 1) 11302.728708
422 RMF Telecom (1, 1) 6344.266519
432 RMF Utilities (1, 1) 11092.744037
In [ ]:
# visualizing
for row in range(df_arma_final.shape[0]):
  # extract column factors
  industry = df_arma_final.iloc[row].Industry
  ftype = df_arma_final.iloc[row].Type
  order = list(df_arma_final.iloc[row].order)

  # subset data
  sub = data.loc[(data.Type == ftype) & (data.Industry == industry)][['ReportDate', 'Flow']]
  # reorder sequence
  sub.sort_values('ReportDate', ascending  = True, inplace=True)
  
  # train & test split
  train = sub[:551]
  test = sub[550:]

  # initialize 
  model = ARIMA(train.Flow, order = (order[0], 0, order[1]))
  # fit
  try:
    model_fit = model.fit()
    # predict
    predict = model_fit.predict(start = 550, end = 578)

    # plot
    plt.figure(figsize= (40, 3))
    plt.plot(train.ReportDate, train.Flow, c = 'b', label = 'Train')
    plt.plot(test.ReportDate, test.Flow, c='orange', label = 'Test')
    plt.plot(test.ReportDate, predict, c='grey', label = order)
    plt.title('ARMA with order = {} for {} of {} Type'.format(order, industry, ftype))
    plt.xlabel('Report Date')
    plt.ylabel('Flow')
    plt.legend()
    plt.show()

  except:
    continue
In [ ]:
# print one subset mse for presentation comparision
# subset: ETF Energy
temp = data.loc[(data.Type == 'ETF') & (data.Industry == 'Energy')][['ReportDate', 'Flow']]
t_train = temp[:551]
t_test = temp[550:]

model = ARIMA(t_train.Flow, order = (0,0,1))

predict = model_fit.predict(start = 550, end = 578)

mean_squared_error(t_test.Flow, predict)
Out[ ]:
166961.1871359826

LSTM Estimatioms

In [ ]:
LSTM_ETF=data[data['Type']== 'ETF']
In [ ]:
LSTM_ETF.head(3)
Out[ ]:
ReportDate Flow FlowPct AssetsEnd PortfolioChangePct Type Industry year month AssetsStart
0 2017-02-01 378.578706 4.5064 8679.056347 -1.1938 ETF Commodities/Materials 2017 2 8300.477641
1 2017-02-01 332.526792 1.1479 28973.613065 -1.1231 ETF Consumer Goods 2017 2 28641.086272
2 2017-02-01 186.031374 0.3782 48446.700077 -1.8855 ETF Energy 2017 2 48260.668703
In [ ]:
LSTM_ETF_energy=LSTM_ETF[LSTM_ETF['Industry']=='Energy']
In [ ]:
LSTM_ETF_energy.head(3)
Out[ ]:
ReportDate Flow FlowPct AssetsEnd PortfolioChangePct Type Industry year month AssetsStart
2 2017-02-01 186.031374 0.3782 48446.700077 -1.8855 ETF Energy 2017 2 48260.668703
21 2017-01-25 242.309208 0.5019 49187.639104 1.3951 ETF Energy 2017 1 48945.329896
40 2017-01-18 237.672693 0.4934 48270.885205 -0.2861 ETF Energy 2017 1 48033.212512
In [ ]:
LSTM_ETF_energy = LSTM_ETF_energy.sort_values(by='ReportDate', ascending=True)
In [ ]:
LSTM_ETF_energy.head(3)
Out[ ]:
ReportDate Flow FlowPct AssetsEnd PortfolioChangePct Type Industry year month AssetsStart
10986 2006-01-04 12.115391 0.7262 1756.771759 4.5757 ETF Energy 2006 1 1744.656368
10967 2006-01-11 16.227106 0.9236 1808.084925 1.9971 ETF Energy 2006 1 1791.857818
10948 2006-01-18 20.685026 1.1440 1846.248449 0.9666 ETF Energy 2006 1 1825.563423
In [ ]:
input_data=LSTM_ETF_energy[['FlowPct','AssetsEnd','PortfolioChangePct','Flow']]
In [ ]:
input_data=input_data.reset_index(drop=True)
In [ ]:
input_data.head(3)
Out[ ]:
FlowPct AssetsEnd PortfolioChangePct Flow
0 0.7262 1756.771759 4.5757 12.115391
1 0.9236 1808.084925 1.9971 16.227106
2 1.1440 1846.248449 0.9666 20.685026
In [ ]:
X = input_data.iloc[:, :-1]
y = input_data.iloc[:, 3:4]
In [ ]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
mm = MinMaxScaler()
ss = StandardScaler()


X_ss = ss.fit_transform(X)
y_mm = mm.fit_transform(y)
In [ ]:
#first 550 for training

X_train = X_ss[:550, :]
X_test = X_ss[550:, :]

y_train = y_mm[:550, :]
y_test = y_mm[550:, :]
In [ ]:
print("Training Shape", X_train.shape, y_train.shape)
print("Testing Shape", X_test.shape, y_test.shape)
Training Shape (550, 3) (550, 1)
Testing Shape (29, 3) (29, 1)
In [ ]:
X_train_tensors = Variable(torch.Tensor(X_train))
X_test_tensors = Variable(torch.Tensor(X_test))

y_train_tensors = Variable(torch.Tensor(y_train))
y_test_tensors = Variable(torch.Tensor(y_test))
In [ ]:
#reshaping to rows, timestamps, features

X_train_tensors_final = torch.reshape(X_train_tensors,   (X_train_tensors.shape[0], 1, X_train_tensors.shape[1]))


X_test_tensors_final = torch.reshape(X_test_tensors,  (X_test_tensors.shape[0], 1, X_test_tensors.shape[1]))
In [ ]:
print("Training Shape", X_train_tensors_final.shape, y_train_tensors.shape)
print("Testing Shape", X_test_tensors_final.shape, y_test_tensors.shape)
Training Shape torch.Size([550, 1, 3]) torch.Size([550, 1])
Testing Shape torch.Size([29, 1, 3]) torch.Size([29, 1])
In [ ]:
class LSTM1(nn.Module):
    def __init__(self, num_classes, input_size, hidden_size, num_layers, seq_length):
        super(LSTM1, self).__init__()
        self.num_classes = num_classes #number of classes
        self.num_layers = num_layers #number of layers
        self.input_size = input_size #input size
        self.hidden_size = hidden_size #hidden state
        self.seq_length = seq_length #sequence length

        self.lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size,
                          num_layers=num_layers, batch_first=True) #lstm
        self.fc_1 =  nn.Linear(hidden_size, 128) #fully connected 1
        self.fc = nn.Linear(128, num_classes) #fully connected last layer

        self.relu = nn.ReLU()
    
    def forward(self,x):
        h_0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_size)) #hidden state
        c_0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_size)) #internal state
        # Propagate input through LSTM
        output, (hn, cn) = self.lstm(x, (h_0, c_0)) #lstm with input, hidden, and internal state
        hn = hn.view(-1, self.hidden_size) #reshaping the data for Dense layer next
        out = self.relu(hn)
        out = self.fc_1(out) #first Dense
        out = self.relu(out) #relu
        out = self.fc(out) #Final Output
        return out
In [ ]:
num_epochs = 1000 #1000 epochs
learning_rate = 0.001 #0.001 lr

input_size = 3 #number of features
hidden_size = 2 #number of features in hidden state
num_layers = 1 #number of stacked lstm layers

num_classes = 1 #number of output classes
In [ ]:
lstm1 = LSTM1(num_classes, input_size, hidden_size, num_layers, X_train_tensors_final.shape[1]) #our lstm class
In [ ]:
criterion = torch.nn.MSELoss()    # mean-squared error for regression
optimizer = torch.optim.Adam(lstm1.parameters(), lr=learning_rate)
In [ ]:
for epoch in range(num_epochs):
  outputs = lstm1.forward(X_train_tensors_final) #forward pass
  optimizer.zero_grad() #caluclate the gradient, manually setting to 0
 
  # obtain the loss function
  loss = criterion(outputs, y_train_tensors)
 
  loss.backward() #calculates the loss of the loss function
 
  optimizer.step() #improve from loss, i.e backprop
  if epoch % 100 == 0:
    print("Epoch: %d, loss: %1.5f" % (epoch, loss.item()))
Epoch: 0, loss: 0.65721
Epoch: 100, loss: 0.00534
Epoch: 200, loss: 0.00478
Epoch: 300, loss: 0.00356
Epoch: 400, loss: 0.00230
Epoch: 500, loss: 0.00166
Epoch: 600, loss: 0.00130
Epoch: 700, loss: 0.00113
Epoch: 800, loss: 0.00099
Epoch: 900, loss: 0.00085
In [ ]:
df_X_ss = ss.transform(input_data.iloc[:, :-1]) #old transformers
df_y_mm = mm.transform(input_data.iloc[:, -1:]) #old transformers

df_X_ss = Variable(torch.Tensor(df_X_ss)) #converting to Tensors
df_y_mm = Variable(torch.Tensor(df_y_mm))
#reshaping the dataset
df_X_ss = torch.reshape(df_X_ss, (df_X_ss.shape[0], 1, df_X_ss.shape[1]))
In [ ]:
train_predict = lstm1(df_X_ss)#forward pass
data_predict = train_predict.data.numpy() #numpy conversion
dataY_plot = df_y_mm.data.numpy()

data_predict = mm.inverse_transform(data_predict) #reverse transformation
dataY_plot = mm.inverse_transform(dataY_plot)
plt.figure(figsize=(40,3)) #plotting

plt.plot(dataY_plot, label='Actuall Data', c = 'b') #actual plot
plt.plot(range(550, 550+len(data_predict[550:])),data_predict[550:], label='Predicted Data',c='orange') #predicted plot
plt.title('LSTM model for Energy of ETF Type')
plt.xlabel('Report Date')
plt.ylabel('Flow')
plt.legend()
plt.show()

Multi-Class Classification

In [ ]:
# ETF (Exchange Traded Funds - insitutional investors)
ETF = pd.read_csv('https://raw.githubusercontent.com/yinghaow525/MSBA-Projects/main/Capstone%20Project/Datasets/US%20Sector%20Inst%20ETF.csv')
# Institutional Mutual Fund Holdings 
IMF = pd.read_csv('https://raw.githubusercontent.com/yinghaow525/MSBA-Projects/main/Capstone%20Project/Datasets/US%20Sector%20Institutional%20MF.csv')
# Retail Mutual Fund Holdings (investments made by individuals in their portfolios
RMF = pd.read_csv('https://raw.githubusercontent.com/yinghaow525/MSBA-Projects/main/Capstone%20Project/Datasets/US%20Sector%20Retail%20MF.csv')
In [ ]:
ETF.head(3)
Out[ ]:
ReportDate AssetClass Flow FlowPct AssetsEnd PortfolioChangePct
0 2/1/2017 12:00:00 AM Commodities/Materials-North America-USA-North ... 378.578706 4.5064 8679.056347 -1.1938
1 2/1/2017 12:00:00 AM Consumer Goods-North America-USA-North America... 332.526792 1.1479 28973.613065 -1.1231
2 2/1/2017 12:00:00 AM Energy-North America-USA-North America-Equity 186.031374 0.3782 48446.700077 -1.8855
In [ ]:
# Create type colunm
ETF['Type'] = 'ETF'
IMF['Type'] = 'IMF'
RMF['Type'] = 'RMF'

# Concat 3 dadaset
data = pd.concat([ETF, IMF, RMF], axis=0)
In [ ]:
# convert report date to pandas datetime object
data['ReportDate'] = pd.to_datetime(data['ReportDate'])
In [ ]:
# since all the Asset and Sector values are the same (Equity, and within US)
# we only keep Industry sector as features
data['AssetClass'] = data['AssetClass'].str.split('-')
data['Industry'] = data.AssetClass.apply(lambda x: x[0])

# drop orignal AssetClass feature
data.drop(columns='AssetClass', inplace=True)

# types and number of Industry values
data.Industry.value_counts()
Out[ ]:
Mid Cap Growth           1737
Technology               1737
Energy                   1737
Financials               1737
Health Care/Biotech      1737
Utilities                1737
Large Cap Blend          1737
Large Cap Growth         1737
Large Cap Value          1737
Mid Cap Blend            1737
Telecom                  1737
Mid Cap Value            1737
Real Estate              1737
Small Cap Blend          1737
Small Cap Growth         1737
Small Cap Value          1737
Commodities/Materials    1700
Consumer Goods           1700
Industrials              1435
Infrastructure             61
Name: Industry, dtype: int64
In [ ]:
data.head(3)
Out[ ]:
ReportDate Flow FlowPct AssetsEnd PortfolioChangePct Type Industry
0 2017-02-01 378.578706 4.5064 8679.056347 -1.1938 ETF Commodities/Materials
1 2017-02-01 332.526792 1.1479 28973.613065 -1.1231 ETF Consumer Goods
2 2017-02-01 186.031374 0.3782 48446.700077 -1.8855 ETF Energy
In [ ]:
data = data.sort_values(by=['ReportDate'])
In [ ]:
# extract year & month 
data['year'], data['month'] = data['ReportDate'].dt.year, data['ReportDate'].dt.month
data.head(12)
Out[ ]:
ReportDate Flow FlowPct AssetsEnd PortfolioChangePct Type Industry year month
10667 2006-01-04 -9.076384 -0.2789 3293.597159 1.5210 RMF Utilities 2006 1
10989 2006-01-04 5.144223 2.4847 213.485252 0.6311 ETF Industrials 2006 1
10988 2006-01-04 45.038766 1.7847 2601.925992 1.3221 ETF Health Care/Biotech 2006 1
10987 2006-01-04 0.000000 0.0000 501.334569 0.8235 ETF Financials 2006 1
10986 2006-01-04 12.115391 0.7262 1756.771759 4.5757 ETF Energy 2006 1
10985 2006-01-04 11.761774 1.8150 662.337236 0.3959 ETF Consumer Goods 2006 1
10984 2006-01-04 23.148567 6.8877 365.747363 1.9393 ETF Commodities/Materials 2006 1
10666 2006-01-04 -0.065491 -0.3344 19.953725 2.2451 RMF Telecom 2006 1
11015 2006-01-04 2.025772 0.4645 442.631215 1.0439 IMF Telecom 2006 1
11014 2006-01-04 -866.881534 -35.0908 1615.664999 0.4920 IMF Technology 2006 1
11013 2006-01-04 64.864213 0.8420 7863.256973 1.2367 IMF Small Cap Value 2006 1
11012 2006-01-04 -63.656249 -0.5549 11579.233991 1.4958 IMF Small Cap Growth 2006 1
In [ ]:
data.head(3)
Out[ ]:
ReportDate Flow FlowPct AssetsEnd PortfolioChangePct Type Industry year month
10667 2006-01-04 -9.076384 -0.2789 3293.597159 1.5210 RMF Utilities 2006 1
10989 2006-01-04 5.144223 2.4847 213.485252 0.6311 ETF Industrials 2006 1
10988 2006-01-04 45.038766 1.7847 2601.925992 1.3221 ETF Health Care/Biotech 2006 1
In [ ]:
data.shape
Out[ ]:
(32688, 9)
In [ ]:
data = data.sort_values(by=['Type', 'Industry','year','month','ReportDate'])
data = data.reset_index(drop=True)
In [ ]:
# 删除符合条件的指定行,并替换原始df
data.drop(data[(data.year == 2017) & (data.month == 2)].index, inplace=True)
data.drop(data[data.Industry == 'Infrastructure'].index, inplace=True)

data.drop(data[(data.year == 2006) & (data.month == 9) & (data.Industry == 'Commodities/Materials') & (data.Type == 'RMF')].index, inplace=True)
data.drop(data[(data.year == 2006) & (data.month == 9) & (data.Industry == 'Consumer Goods') & (data.Type == 'RMF')].index, inplace=True)
data.drop(data[(data.year == 2011) & (data.month == 10) & (data.Industry == 'Industrials') & (data.Type == 'RMF')].index, inplace=True)
In [ ]:
data.shape
Out[ ]:
(32564, 9)
In [ ]:
def converter(x):
    return pd.Series({
        'Flow':sum(x.Flow.unique()),
        'FlowPct1':x.FlowPct[:1].unique()[0],
        'FlowPct4':x.FlowPct.unique()[-1],
        'AssetsEnd1':x.AssetsEnd[:1].unique()[0],
        'AssetsEnd4':x.AssetsEnd.unique()[-1],
        'PortfolioChangePct1':x.PortfolioChangePct[:1].unique()[0],
          #  'PortfolioChangePct2':x.PortfolioChangePct.unique()[:, 1],
        'PortfolioChangePct4':x.PortfolioChangePct.unique()[-1],
        'Type':x.Type[:1].unique()[0],
        'Industry':x.Industry[:1].unique()[0],
        'year':x.year[:1].unique()[0],
        'month':x.month[:1].unique()[0]
        }
    )

# def converter(x):
#     return pd.Series({
#         'Flow':sum(x.Flow.unique()),
#         'FlowPct1':x.FlowPct.unique()[0],
#         'FlowPct2':x.FlowPct.unique()[1],
#         'FlowPct3':x.FlowPct.unique()[2],
#         'FlowPct4':x.FlowPct.unique()[-1],
      
#         'AssetsEnd1':x.AssetsEnd[:3].unique()[0],
#         'AssetsEnd2':x.AssetsEnd[:3].unique()[1],
#         'AssetsEnd3':x.AssetsEnd[:3].unique()[2],
#         'AssetsEnd4':x.AssetsEnd.unique()[-1],
#         'PortfolioChangePct1':x.PortfolioChangePct[:3].unique()[0],
#         'PortfolioChangePct2':x.PortfolioChangePct[:3].unique()[1],
#         'PortfolioChangePct3':x.PortfolioChangePct[:3].unique()[2],
#         'PortfolioChangePct4':x.PortfolioChangePct.unique()[-1],
#         'Type':x.Type[:1].unique()[0],
#         'Industry':x.Industry[:1].unique()[0],
#         'year':x.year[:1].unique()[0],
#         'month':x.month[:1].unique()[0]
#         }
#     )



group_filter = ['Industry','Type' , 'year', 'month']
data2=data.groupby(group_filter).apply(converter)
In [ ]:
data2['Flow'] = data2['Flow'].shift(-1)
In [ ]:
data2.rename(columns = {'Flow':'NextMonthFlow'}, inplace = True)
In [ ]:
data2
Out[ ]:
NextMonthFlow FlowPct1 FlowPct4 AssetsEnd1 AssetsEnd4 PortfolioChangePct1 PortfolioChangePct4 Type Industry year month
Industry Type year month
Commodities/Materials ETF 2006 1 39.448926 6.8877 -9.5029 365.747363 342.463192 1.9393 2.6315 ETF Commodities/Materials 2006 1
2 165.453872 0.0000 11.1578 353.551743 389.790653 3.2378 1.8497 ETF Commodities/Materials 2006 2
3 -134.278695 0.6584 4.0095 1226.034803 1426.387768 -0.7101 1.9107 ETF Commodities/Materials 2006 3
4 4.147121 -6.7927 -1.9407 1360.070894 1360.132199 2.1434 -0.8257 ETF Commodities/Materials 2006 4
5 299.481155 -5.8893 2.9431 1286.578357 1293.819947 0.4815 1.3907 ETF Commodities/Materials 2006 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Utilities RMF 2016 9 -138.551304 -0.3197 0.0272 14498.061035 14371.602536 2.9829 -0.6236 RMF Utilities 2016 9
10 -220.938402 -0.1423 -0.2206 13752.934773 13742.554326 -4.1624 -0.7032 RMF Utilities 2016 10
11 -195.533057 -0.2245 -0.2621 13317.818449 13178.058222 -2.8661 -0.0082 RMF Utilities 2016 11
12 -79.262456 -0.3005 -0.6127 13297.404282 13388.036351 1.2053 -0.5514 RMF Utilities 2016 12
2017 1 NaN -0.1206 -0.2070 13560.926194 13589.950808 1.4120 0.1333 RMF Utilities 2017 1

7493 rows × 11 columns

In [ ]:
d = data.groupby(group_filter,as_index=False,sort=False).aggregate(lambda x: ','.join(map(str, x)))
In [ ]:
for i in range(len(d)):
  if len(d.FlowPct[i].split(',')) < 4:
    print(i)
In [ ]:
d.FlowPct[135]
Out[ ]:
'1.2817,-1.9752,1.9783,4.0751,-1.0595'
In [ ]:
FlowPct2 = []
FlowPct3 = []
AssetsEnd2 = []
AssetsEnd3 = []
PortfolioChangePct2 = []
PortfolioChangePct3 = []

for i in range(len(d)):

  FlowPct2.append(float(d.FlowPct[i].split(',')[1]))
  FlowPct3.append(float(d.FlowPct[i].split(',')[2]))
  AssetsEnd2.append(float(d.AssetsEnd[i].split(',')[1]))
  AssetsEnd3.append(float(d.AssetsEnd[i].split(',')[2]))
  PortfolioChangePct2.append(float(d.PortfolioChangePct[i].split(',')[1]))
  PortfolioChangePct3.append(float(d.PortfolioChangePct[i].split(',')[2]))
In [ ]:
data2['FlowPct2'] = FlowPct2
data2['FlowPct3'] = FlowPct3
data2['AssetsEnd2'] = AssetsEnd2
data2['AssetsEnd3'] = AssetsEnd3
data2['PortfolioChangePct2'] = PortfolioChangePct2
data2['PortfolioChangePct3'] = PortfolioChangePct3
In [ ]:
data2
Out[ ]:
NextMonthFlow FlowPct1 FlowPct4 AssetsEnd1 AssetsEnd4 PortfolioChangePct1 PortfolioChangePct4 Type Industry year month FlowPct2 FlowPct3 AssetsEnd2 AssetsEnd3 PortfolioChangePct2 PortfolioChangePct3
Industry Type year month
Commodities/Materials ETF 2006 1 39.448926 6.8877 -9.5029 365.747363 342.463192 1.9393 2.6315 ETF Commodities/Materials 2006 1 3.1609 -0.6022 376.895214 367.731276 -0.1130 -1.8291
2 165.453872 0.0000 11.1578 353.551743 389.790653 3.2378 1.8497 ETF Commodities/Materials 2006 2 11.1578 0.0000 380.566108 382.711351 -3.5170 0.5636
3 -134.278695 0.6584 4.0095 1226.034803 1426.387768 -0.7101 1.9107 ETF Commodities/Materials 2006 3 -6.9428 -4.0455 1113.668457 1119.426080 -2.2222 4.5625
4 4.147121 -6.7927 -1.9407 1360.070894 1360.132199 2.1434 -0.8257 ETF Commodities/Materials 2006 4 -2.7356 2.0618 1307.920822 1398.830639 -1.0987 4.8888
5 299.481155 -5.8893 2.9431 1286.578357 1293.819947 0.4815 1.3907 ETF Commodities/Materials 2006 5 3.0092 -0.2588 1361.731888 1254.982939 2.8320 -7.5803
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Utilities RMF 2016 9 -138.551304 -0.3197 0.0272 14498.061035 14371.602536 2.9829 -0.6236 RMF Utilities 2016 9 -0.2149 -0.1768 13939.886858 14457.839909 -3.6350 3.8924
10 -220.938402 -0.1423 -0.2206 13752.934773 13742.554326 -4.1624 -0.7032 RMF Utilities 2016 10 -0.3798 -0.2580 13660.789697 13870.693964 -0.2901 1.7946
11 -195.533057 -0.2245 -0.2621 13317.818449 13178.058222 -2.8661 -0.0082 RMF Utilities 2016 11 -0.3609 -0.5072 13350.409748 13086.205981 0.6056 -1.4717
12 -79.262456 -0.3005 -0.6127 13297.404282 13388.036351 1.2053 -0.5514 RMF Utilities 2016 12 -0.2203 -0.3270 13336.608511 13545.731856 0.5152 1.8950
2017 1 NaN -0.1206 -0.2070 13560.926194 13589.950808 1.4120 0.1333 RMF Utilities 2017 1 -0.1404 -0.1177 13512.353881 13599.970052 -0.2177 0.7661

7493 rows × 17 columns

In [ ]:
data2 = data2[['FlowPct1','FlowPct2','FlowPct3','FlowPct4','AssetsEnd1','AssetsEnd2','AssetsEnd3','AssetsEnd4','PortfolioChangePct1','PortfolioChangePct2','PortfolioChangePct3','PortfolioChangePct4','Type','Industry','year','month','NextMonthFlow']]
data2
Out[ ]:
FlowPct1 FlowPct2 FlowPct3 FlowPct4 AssetsEnd1 AssetsEnd2 AssetsEnd3 AssetsEnd4 PortfolioChangePct1 PortfolioChangePct2 PortfolioChangePct3 PortfolioChangePct4 Type Industry year month NextMonthFlow
Industry Type year month
Commodities/Materials ETF 2006 1 6.8877 3.1609 -0.6022 -9.5029 365.747363 376.895214 367.731276 342.463192 1.9393 -0.1130 -1.8291 2.6315 ETF Commodities/Materials 2006 1 39.448926
2 0.0000 11.1578 0.0000 11.1578 353.551743 380.566108 382.711351 389.790653 3.2378 -3.5170 0.5636 1.8497 ETF Commodities/Materials 2006 2 165.453872
3 0.6584 -6.9428 -4.0455 4.0095 1226.034803 1113.668457 1119.426080 1426.387768 -0.7101 -2.2222 4.5625 1.9107 ETF Commodities/Materials 2006 3 -134.278695
4 -6.7927 -2.7356 2.0618 -1.9407 1360.070894 1307.920822 1398.830639 1360.132199 2.1434 -1.0987 4.8888 -0.8257 ETF Commodities/Materials 2006 4 4.147121
5 -5.8893 3.0092 -0.2588 2.9431 1286.578357 1361.731888 1254.982939 1293.819947 0.4815 2.8320 -7.5803 1.3907 ETF Commodities/Materials 2006 5 299.481155
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Utilities RMF 2016 9 -0.3197 -0.2149 -0.1768 0.0272 14498.061035 13939.886858 14457.839909 14371.602536 2.9829 -3.6350 3.8924 -0.6236 RMF Utilities 2016 9 -138.551304
10 -0.1423 -0.3798 -0.2580 -0.2206 13752.934773 13660.789697 13870.693964 13742.554326 -4.1624 -0.2901 1.7946 -0.7032 RMF Utilities 2016 10 -220.938402
11 -0.2245 -0.3609 -0.5072 -0.2621 13317.818449 13350.409748 13086.205981 13178.058222 -2.8661 0.6056 -1.4717 -0.0082 RMF Utilities 2016 11 -195.533057
12 -0.3005 -0.2203 -0.3270 -0.6127 13297.404282 13336.608511 13545.731856 13388.036351 1.2053 0.5152 1.8950 -0.5514 RMF Utilities 2016 12 -79.262456
2017 1 -0.1206 -0.1404 -0.1177 -0.2070 13560.926194 13512.353881 13599.970052 13589.950808 1.4120 -0.2177 0.7661 0.1333 RMF Utilities 2017 1 NaN

7493 rows × 17 columns

In [ ]:
data2['Inflow'] = np.where((data2.AssetsEnd4 - data2.AssetsEnd1) > 0, int(1), int(0))
In [ ]:
data2.drop(data2.tail(1).index,inplace=True)
In [ ]:
# Check the correlation of the features 
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(15,10))
corr = data2.corr()
sns.heatmap(corr,linewidths=.2, cmap="YlGnBu", 
                 annot=True,
                 fmt=".2f",
                 annot_kws={'size':8,'weight':'normal', 'color':'#253D24'})
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc3f9b02150>
In [ ]:
import matplotlib.pyplot as plt
In [ ]:
data2 = pd.get_dummies(data2)
In [ ]:
data2 = data2.reset_index(drop=True)
data2
Out[ ]:
FlowPct1 FlowPct2 FlowPct3 FlowPct4 AssetsEnd1 AssetsEnd2 AssetsEnd3 AssetsEnd4 PortfolioChangePct1 PortfolioChangePct2 ... Industry_Mid Cap Blend Industry_Mid Cap Growth Industry_Mid Cap Value Industry_Real Estate Industry_Small Cap Blend Industry_Small Cap Growth Industry_Small Cap Value Industry_Technology Industry_Telecom Industry_Utilities
0 6.8877 3.1609 -0.6022 -9.5029 365.747363 376.895214 367.731276 342.463192 1.9393 -0.1130 ... 0 0 0 0 0 0 0 0 0 0
1 0.0000 11.1578 0.0000 11.1578 353.551743 380.566108 382.711351 389.790653 3.2378 -3.5170 ... 0 0 0 0 0 0 0 0 0 0
2 0.6584 -6.9428 -4.0455 4.0095 1226.034803 1113.668457 1119.426080 1426.387768 -0.7101 -2.2222 ... 0 0 0 0 0 0 0 0 0 0
3 -6.7927 -2.7356 2.0618 -1.9407 1360.070894 1307.920822 1398.830639 1360.132199 2.1434 -1.0987 ... 0 0 0 0 0 0 0 0 0 0
4 -5.8893 3.0092 -0.2588 2.9431 1286.578357 1361.731888 1254.982939 1293.819947 0.4815 2.8320 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7487 -0.3026 -0.0732 -0.4560 -0.1859 14739.462527 14597.096789 14473.057879 14121.270467 -0.1562 -0.8926 ... 0 0 0 0 0 0 0 0 0 1
7488 -0.3197 -0.2149 -0.1768 0.0272 14498.061035 13939.886858 14457.839909 14371.602536 2.9829 -3.6350 ... 0 0 0 0 0 0 0 0 0 1
7489 -0.1423 -0.3798 -0.2580 -0.2206 13752.934773 13660.789697 13870.693964 13742.554326 -4.1624 -0.2901 ... 0 0 0 0 0 0 0 0 0 1
7490 -0.2245 -0.3609 -0.5072 -0.2621 13317.818449 13350.409748 13086.205981 13178.058222 -2.8661 0.6056 ... 0 0 0 0 0 0 0 0 0 1
7491 -0.3005 -0.2203 -0.3270 -0.6127 13297.404282 13336.608511 13545.731856 13388.036351 1.2053 0.5152 ... 0 0 0 0 0 0 0 0 0 1

7492 rows × 38 columns

In [ ]:
# Check outliers
# Plot the boxplot od ratio features which are within the same scale
import matplotlib.pyplot as plt
plt.figure(figsize=(15,10))
data2.boxplot(column= [i for i in data2[['FlowPct1', 'FlowPct2','FlowPct3','FlowPct4', 
       'PortfolioChangePct1', 'PortfolioChangePct2', 'PortfolioChangePct3', 'PortfolioChangePct4']].columns.tolist()]);
# plt.ylim([-50, 100])

# 'AssetsEnd1', 'AssetsEnd4',
In [ ]:
# Winsorizing the ratioa variables at 1% and 99% values
from scipy.stats.mstats import winsorize

data2['FlowPct1'] = winsorize(data2['FlowPct1'], limits = [0.01,0.01])
data2['FlowPct2'] = winsorize(data2['FlowPct2'], limits = [0.01,0.01])
data2['FlowPct3'] = winsorize(data2['FlowPct3'], limits = [0.01,0.01])
data2['FlowPct4'] = winsorize(data2['FlowPct4'], limits = [0.01,0.01])
data2['PortfolioChangePct1'] = winsorize(data2['PortfolioChangePct1'], limits = [0.01,0.01])
data2['PortfolioChangePct2'] = winsorize(data2['PortfolioChangePct2'], limits = [0.01,0.01])
data2['PortfolioChangePct3'] = winsorize(data2['PortfolioChangePct3'], limits = [0.01,0.01])
data2['PortfolioChangePct4'] = winsorize(data2['PortfolioChangePct4'], limits = [0.01,0.01])
In [ ]:
# Check outliers
# Plot the boxplot od ratio features which are within the same scale
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
data2.boxplot(column= [i for i in data2[['AssetsEnd1', 'AssetsEnd2', 'AssetsEnd3', 'AssetsEnd4']].columns.tolist()]);
# plt.ylim([-50, 100])
In [ ]:
# There are many outliers and the data is right-skewed.
plt.figure(figsize=(10,6))
sns.distplot(data2['AssetsEnd1'], bins = 15, kde = False)
fig = sns.distplot(data2['AssetsEnd2'], bins = 15, kde = False)
fig = sns.distplot(data2['AssetsEnd3'], bins = 15, kde = False)
fig = sns.distplot(data2['AssetsEnd4'], bins = 15, kde = False)
fig.legend(labels = ['AssetsEnd1','AssetsEnd2','AssetsEnd3','AssetsEnd4'])
Out[ ]:
<matplotlib.legend.Legend at 0x7fc3f9bc0450>
In [ ]:
# Here we need to use Log transformation to transform data to normal or close to normal
import numpy as np

AssetsEnd1_log = np.log(data2['AssetsEnd1'])
AssetsEnd1_log = np.log(data2['AssetsEnd2'])
AssetsEnd1_log = np.log(data2['AssetsEnd3'])
AssetsEnd4_log = np.log(data2['AssetsEnd4'])

data2['AssetsEnd1_log'] = AssetsEnd1_log
data2['AssetsEnd2_log'] = AssetsEnd1_log
data2['AssetsEnd3_log'] = AssetsEnd1_log
data2['AssetsEnd4_log'] = AssetsEnd4_log

data2 = data2.drop(['AssetsEnd1', 'AssetsEnd2','AssetsEnd3','AssetsEnd4'], axis=1)
In [ ]:
# There are many outliers and the data is right-skewed.
plt.figure(figsize=(10,6))
sns.distplot(data2['AssetsEnd1_log'], bins = 15, kde = False)
fig = sns.distplot(data2['AssetsEnd2_log'], bins = 15, kde = False)
fig = sns.distplot(data2['AssetsEnd3_log'], bins = 15, kde = False)
fig = sns.distplot(data2['AssetsEnd4_log'], bins = 15, kde = False)
fig.legend(labels = ['AssetsEnd1_log','AssetsEnd2_log','AssetsEnd3_log','AssetsEnd4_log'])
Out[ ]:
<matplotlib.legend.Legend at 0x7fc3f9c39ed0>
In [ ]:
data2.shape
Out[ ]:
(7492, 38)
In [ ]:
label = data2[['NextMonthFlow']]
data2 = data2.drop(['NextMonthFlow'],axis=1)
In [ ]:
import seaborn as sns

sns.histplot(label['NextMonthFlow'])
plt.ylim((0,600))
Out[ ]:
(0.0, 600.0)
In [ ]:
classes = []

for i in label['NextMonthFlow'].values: 
  if i < -100 :
    classes.append('-1')
  elif i >=-100 and i<=100:
    classes.append('0')
  elif i >100:
    classes.append('1')
In [ ]:
label['classes'] = classes
In [ ]:
label
Out[ ]:
NextMonthFlow classes
0 39.448926 0
1 165.453872 1
2 -134.278695 -1
3 4.147121 0
4 299.481155 1
... ... ...
7487 -97.042509 0
7488 -138.551304 -1
7489 -220.938402 -1
7490 -195.533057 -1
7491 -79.262456 0

7492 rows × 2 columns

In [ ]:
print('The number of Flow located within (,-100) is'.format(), len(label.loc[label['NextMonthFlow']<-100] == True))
print('The number of Flow located within [-100,100] is'.format(),len(label.loc[(label['NextMonthFlow']>=-100) & (label['NextMonthFlow']<=100)] == True))
print('The number of Flow located within (100,) is'.format(),len(label.loc[(label['NextMonthFlow']>100) & (label['NextMonthFlow']<25000)] == True))
The number of Flow located within (,-100) is 2589
The number of Flow located within [-100,100] is 2288
The number of Flow located within (100,) is 2615
In [ ]:
pd.value_counts(label['classes'])
Out[ ]:
1     2615
-1    2589
0     2288
Name: classes, dtype: int64
In [ ]:
# Check the correlation of the features 
import seaborn as sns

plt.figure(figsize=(20,15))
corr = data2.corr()
sns.heatmap(corr,linewidths=.2, cmap="YlGnBu", 
                 annot=True,
                 fmt=".2f",
                 annot_kws={'size':8,'weight':'normal', 'color':'#253D24'})
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc3f9d6bfd0>
In [ ]:
def df_norm(df, cols):
    df_n = df[cols]
    for col in cols:
        ma = df_n[col].max()
        mi = df_n[col].min()
        df_n[str(col) + '_n'] = (df_n[col] - mi) / (ma - mi)
    return df_n
In [ ]:
data2['year'] = 2022 - data2['year']
In [ ]:
data2
Out[ ]:
FlowPct1 FlowPct2 FlowPct3 FlowPct4 PortfolioChangePct1 PortfolioChangePct2 PortfolioChangePct3 PortfolioChangePct4 year month ... Industry_Small Cap Blend Industry_Small Cap Growth Industry_Small Cap Value Industry_Technology Industry_Telecom Industry_Utilities AssetsEnd1_log AssetsEnd2_log AssetsEnd3_log AssetsEnd4_log
0 6.8877 3.1609 -0.6022 -9.5029 1.9393 -0.1130 -1.8291 2.6315 16 1 ... 0 0 0 0 0 0 5.907352 5.907352 5.907352 5.836164
1 0.0000 11.1578 0.0000 11.1578 3.2378 -3.5170 0.5636 1.8497 16 2 ... 0 0 0 0 0 0 5.947281 5.947281 5.947281 5.965610
2 0.6584 -6.9428 -4.0455 4.0095 -0.7101 -2.2222 4.5625 1.9107 16 3 ... 0 0 0 0 0 0 7.020571 7.020571 7.020571 7.262900
3 -6.7927 -2.7356 2.0618 -1.9407 2.1434 -1.0987 4.8888 -0.8257 16 4 ... 0 0 0 0 0 0 7.243392 7.243392 7.243392 7.215337
4 -5.8893 3.0092 -0.2588 2.9431 0.4815 2.8320 -7.5803 1.3907 16 5 ... 0 0 0 0 0 0 7.134877 7.134877 7.134877 7.165354
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7487 -0.3026 -0.0732 -0.4560 -0.1859 -0.1562 -0.8926 -0.3937 -1.6336 6 8 ... 0 0 0 0 0 1 9.580044 9.580044 9.580044 9.555437
7488 -0.3197 -0.2149 -0.1768 0.0272 2.9829 -3.6350 3.8924 -0.6236 6 9 ... 0 0 0 0 0 1 9.578992 9.578992 9.578992 9.573009
7489 -0.1423 -0.3798 -0.2580 -0.2206 -4.1624 -0.2901 1.7946 -0.7032 6 10 ... 0 0 0 0 0 1 9.537534 9.537534 9.537534 9.528252
7490 -0.2245 -0.3609 -0.5072 -0.2621 -2.8661 0.6056 -1.4717 -0.0082 6 11 ... 0 0 0 0 0 1 9.479314 9.479314 9.479314 9.486308
7491 -0.3005 -0.2203 -0.3270 -0.6127 1.2053 0.5152 1.8950 -0.5514 6 12 ... 0 0 0 0 0 1 9.513827 9.513827 9.513827 9.502117

7492 rows × 37 columns

In [ ]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(np.array(data2), np.array(label['classes']), test_size=0.2, random_state=888)
In [ ]:
models=[KNeighborsClassifier(),
        LogisticRegression(penalty='l2'),
        RandomForestClassifier(n_estimators=15),
        tree.DecisionTreeClassifier(),
        GradientBoostingClassifier(n_estimators=1000),
        AdaBoostClassifier(),
        GaussianNB(),
        LinearDiscriminantAnalysis(),
        QuadraticDiscriminantAnalysis(),
        SVC(kernel='rbf', probability=True),
        ]
        
models_str=['KNN',
            'LogisticRegression',
            'RandomForest',
            'DecisionTree',
            'GBDT',
            'AdaBoost',
            'GaussianNB',
            'LinearDiscriminantAnalysis',
            'QuadraticDiscriminantAnalysis',
            'SVM']
score_=[]
In [ ]:
for name,model in zip(models_str,models):
    print('Start Training:'+name)
    model=model
    model.fit(Xtrain,ytrain)
    y_pred=model.predict(Xtest)
    # pre_y_train = clf.predict(X_train)
    pre_y_test = model.predict(Xtest)
    # print("lr Metrics : {0}".format(precision_recall_fscore_support(ytest, pre_y_test)))  
    # score=model.score(Xtest,ytest)
    # score_.append(str(score)[:5])
    # print(name +'Score:'+str(score))


    from sklearn.metrics import classification_report
    from sklearn import metrics

    true_false = (pre_y_test == ytest)
    accuracy = np.count_nonzero(true_false) / float(len(ytest))
    print()
    print("accuracy is %f" % accuracy)

    # precision    recall  f1-score
    print()
    print(metrics.classification_report(ytest, pre_y_test))

    # 混淆矩阵
    print("Confusion Matrix...")
    print(metrics.confusion_matrix(ytest, pre_y_test))
Start Training:KNN

accuracy is 0.521014

              precision    recall  f1-score   support

          -1       0.48      0.64      0.55       507
           0       0.57      0.53      0.55       459
           1       0.53      0.39      0.45       533

    accuracy                           0.52      1499
   macro avg       0.53      0.52      0.52      1499
weighted avg       0.53      0.52      0.52      1499

Confusion Matrix...
[[327  71 109]
 [141 244  74]
 [213 110 210]]
Start Training:LogisticRegression

accuracy is 0.581721

              precision    recall  f1-score   support

          -1       0.55      0.51      0.53       507
           0       0.68      0.65      0.67       459
           1       0.53      0.59      0.56       533

    accuracy                           0.58      1499
   macro avg       0.59      0.58      0.59      1499
weighted avg       0.58      0.58      0.58      1499

Confusion Matrix...
[[261  59 187]
 [ 75 297  87]
 [141  78 314]]
Start Training:RandomForest

accuracy is 0.604403

              precision    recall  f1-score   support

          -1       0.56      0.63      0.59       507
           0       0.70      0.60      0.65       459
           1       0.58      0.59      0.58       533

    accuracy                           0.60      1499
   macro avg       0.61      0.60      0.61      1499
weighted avg       0.61      0.60      0.61      1499

Confusion Matrix...
[[317  46 144]
 [ 97 277  85]
 [148  73 312]]
Start Training:DecisionTree

accuracy is 0.577718

              precision    recall  f1-score   support

          -1       0.55      0.58      0.57       507
           0       0.63      0.61      0.62       459
           1       0.56      0.54      0.55       533

    accuracy                           0.58      1499
   macro avg       0.58      0.58      0.58      1499
weighted avg       0.58      0.58      0.58      1499

Confusion Matrix...
[[296  69 142]
 [ 89 280  90]
 [150  93 290]]
Start Training:GBDT

accuracy is 0.672448

              precision    recall  f1-score   support

          -1       0.65      0.65      0.65       507
           0       0.74      0.68      0.71       459
           1       0.64      0.68      0.66       533

    accuracy                           0.67      1499
   macro avg       0.68      0.67      0.67      1499
weighted avg       0.67      0.67      0.67      1499

Confusion Matrix...
[[331  46 130]
 [ 74 314  71]
 [108  62 363]]
Start Training:AdaBoost

accuracy is 0.583055

              precision    recall  f1-score   support

          -1       0.55      0.55      0.55       507
           0       0.67      0.64      0.65       459
           1       0.54      0.57      0.56       533

    accuracy                           0.58      1499
   macro avg       0.59      0.58      0.59      1499
weighted avg       0.59      0.58      0.58      1499

Confusion Matrix...
[[277  59 171]
 [ 82 292  85]
 [144  84 305]]
Start Training:GaussianNB

accuracy is 0.512342

              precision    recall  f1-score   support

          -1       0.49      0.56      0.52       507
           0       0.52      0.69      0.59       459
           1       0.54      0.32      0.40       533

    accuracy                           0.51      1499
   macro avg       0.52      0.52      0.50      1499
weighted avg       0.52      0.51      0.50      1499

Confusion Matrix...
[[282 131  94]
 [ 95 315  49]
 [199 163 171]]
Start Training:LinearDiscriminantAnalysis

accuracy is 0.586391

              precision    recall  f1-score   support

          -1       0.55      0.53      0.54       507
           0       0.75      0.57      0.65       459
           1       0.53      0.65      0.58       533

    accuracy                           0.59      1499
   macro avg       0.61      0.58      0.59      1499
weighted avg       0.60      0.59      0.59      1499

Confusion Matrix...
[[270  37 200]
 [ 83 263 113]
 [135  52 346]]
Start Training:QuadraticDiscriminantAnalysis

accuracy is 0.447632

              precision    recall  f1-score   support

          -1       0.47      0.38      0.42       507
           0       0.43      0.79      0.56       459
           1       0.45      0.21      0.29       533

    accuracy                           0.45      1499
   macro avg       0.45      0.46      0.42      1499
weighted avg       0.45      0.45      0.42      1499

Confusion Matrix...
[[195 207 105]
 [ 62 363  34]
 [154 266 113]]
Start Training:SVM

accuracy is 0.587058

              precision    recall  f1-score   support

          -1       0.60      0.43      0.50       507
           0       0.72      0.60      0.66       459
           1       0.51      0.72      0.60       533

    accuracy                           0.59      1499
   macro avg       0.61      0.59      0.59      1499
weighted avg       0.61      0.59      0.58      1499

Confusion Matrix...
[[218  43 246]
 [ 63 276 120]
 [ 85  62 386]]